[OMNIML-4672] training_support#1477
Conversation
|
No actionable comments were generated in the recent review. 🎉 ℹ️ Recent review info⚙️ Run configurationConfiguration used: Path: .coderabbit.yaml Review profile: CHILL Plan: Enterprise Run ID: 📒 Files selected for processing (1)
🚧 Files skipped from review as they are similar to previous changes (1)
📝 WalkthroughWalkthroughAdds a new YAML pipeline at tools/launcher/examples/Qwen/Qwen3-8B/step3_train.yaml defining a Qwen3-8B_EAGLE3_train job with a global hf_model path, one training task invoking common/eagle3/train_eagle.sh with offline EAGLE3 args, and SLURM/container runtime settings. ChangesQwen3-8B EAGLE3 Training Configuration
🎯 1 (Trivial) | ⏱️ ~3 minutes 🚥 Pre-merge checks | ✅ 5 | ❌ 1❌ Failed checks (1 inconclusive)
✅ Passed checks (5 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Comment |
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@tools/launcher/examples/Qwen/Qwen3-8B/step3_train.yaml`:
- Around line 11-21: Add an environment block to task_0 that sets the required
model env vars: include MLM_MODEL_CFG with the HuggingFace repo ID for this
model and QUANT_CFG with the chosen quantization config (e.g., NVFP4_DEFAULT_CFG
or INT8_DEFAULT_CFG); ensure the environment uses the project-required
list-of-single-key-dicts format (each env var as its own single-key mapping) so
tools/launcher parsing and downstream scripts like common/eagle3/train_eagle.sh
can read MLM_MODEL_CFG and QUANT_CFG.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Enterprise
Run ID: 2dac94de-3121-43ac-acaa-ec0b95ccc86e
📒 Files selected for processing (1)
tools/launcher/examples/Qwen/Qwen3-8B/step3_train.yaml
| task_0: | ||
| script: common/eagle3/train_eagle.sh | ||
| args: | ||
| - --config modules/Model-Optimizer/modelopt_recipes/general/speculative_decoding/eagle3.yaml | ||
| - model.model_name_or_path=<<global_vars.hf_model>> | ||
| - data.offline_data_path=/scratchspace/offline_hidden_states | ||
| - training.output_dir=/scratchspace/eagle3 | ||
| - training.training_seq_len=4096 | ||
| - training.disable_tqdm=true | ||
| - training.ar_validate_steps=500000 | ||
| slurm_config: |
There was a problem hiding this comment.
Add required model environment variables for this new config.
task_0 is missing the required environment block with MLM_MODEL_CFG (HF repo ID) and QUANT_CFG, and the env format should be list-of-single-key-dicts.
Suggested patch
task_0:
script: common/eagle3/train_eagle.sh
+ environment:
+ - MLM_MODEL_CFG: Qwen/Qwen3-8B
+ - QUANT_CFG: NVFP4_DEFAULT_CFG
args:
- --config modules/Model-Optimizer/modelopt_recipes/general/speculative_decoding/eagle3.yaml
- model.model_name_or_path=<<global_vars.hf_model>>As per coding guidelines, tools/launcher/**/*.yaml requires “environment as list-of-single-key-dicts”, “Set MLM_MODEL_CFG environment variable to the HuggingFace repo ID when adding a new model config”, and “Set QUANT_CFG environment variable (e.g., NVFP4_DEFAULT_CFG, INT8_DEFAULT_CFG) when adding a new model config”.
📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| task_0: | |
| script: common/eagle3/train_eagle.sh | |
| args: | |
| - --config modules/Model-Optimizer/modelopt_recipes/general/speculative_decoding/eagle3.yaml | |
| - model.model_name_or_path=<<global_vars.hf_model>> | |
| - data.offline_data_path=/scratchspace/offline_hidden_states | |
| - training.output_dir=/scratchspace/eagle3 | |
| - training.training_seq_len=4096 | |
| - training.disable_tqdm=true | |
| - training.ar_validate_steps=500000 | |
| slurm_config: | |
| task_0: | |
| script: common/eagle3/train_eagle.sh | |
| environment: | |
| - MLM_MODEL_CFG: Qwen/Qwen3-8B | |
| - QUANT_CFG: NVFP4_DEFAULT_CFG | |
| args: | |
| - --config modules/Model-Optimizer/modelopt_recipes/general/speculative_decoding/eagle3.yaml | |
| - model.model_name_or_path=<<global_vars.hf_model>> | |
| - data.offline_data_path=/scratchspace/offline_hidden_states | |
| - training.output_dir=/scratchspace/eagle3 | |
| - training.training_seq_len=4096 | |
| - training.disable_tqdm=true | |
| - training.ar_validate_steps=500000 | |
| slurm_config: |
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@tools/launcher/examples/Qwen/Qwen3-8B/step3_train.yaml` around lines 11 - 21,
Add an environment block to task_0 that sets the required model env vars:
include MLM_MODEL_CFG with the HuggingFace repo ID for this model and QUANT_CFG
with the chosen quantization config (e.g., NVFP4_DEFAULT_CFG or
INT8_DEFAULT_CFG); ensure the environment uses the project-required
list-of-single-key-dicts format (each env var as its own single-key mapping) so
tools/launcher parsing and downstream scripts like common/eagle3/train_eagle.sh
can read MLM_MODEL_CFG and QUANT_CFG.
Agent-authored via pensieve-intern's training_support stage on Epic OMNIML-4666. Faithful extraction of task_2 (EAGLE3 draft-head training) from hf_offline_eagle3.yaml's monolithic pipeline, renamed task_0 for the standalone step convention. Signed-off-by: Chenhan D. Yu <chenhany@nvidia.com>
341c8fa to
4300a1d
Compare
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## main #1477 +/- ##
==========================================
- Coverage 76.78% 76.78% -0.01%
==========================================
Files 473 473
Lines 51413 51413
==========================================
- Hits 39476 39475 -1
- Misses 11937 11938 +1
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
|
Closing — wrong artifact shape. Each pensieve-intern support stage was authoring a separate
Pensieve-intern v0.33.33 will land this redesign — |
|
Draft PR opened by pensieve-intern for OMNIML-4672.
Stage
training_supportof EpicOMNIML-4666. The agent ran from the SPEC on the ticket description; review every change before marking ready.Always-draft is enforced — the bot never auto-merges.
Summary by CodeRabbit