EAGLE3 new model support: pipeline configs, triage docs, and Ministral-3 fixes#1417
Conversation
Add hf_offline_eagle3.yaml configs in tools/launcher/examples/ for: - DeepSeek/DeepSeek-V3.2 (685B MoE, 2 nodes, TP=8) - GLM/GLM-5 (744B MoE, 2 nodes bench, TP=4/EP=2) - MiniMax/MiniMax-M2.5 (230B MoE, TP=4/EP=4) - Mistral/Ministral-3-8B (8B dense, TP=4) - Mistral/Ministral-3-14B (14B dense, TP=4) - MoonshotAI/Kimi-K2.5 (1T MoE, TP=4/EP=1) - NVIDIA/Kimi-K2.5-NVFP4 (NVFP4 quant; tasks 1-2 use BF16 base) - OpenAI/GPT-OSS-20B (20B dense, TP=4) - Qwen/Qwen3.5-27B (27B dense VLM, TP=4) - Qwen/Qwen3.5-35B-A3B (35B MoE, TP=4/EP=4) - Qwen/Qwen3.5-9B (9B dense VLM, TP=4) - StepFun/Step-3.5-Flash (197B MoE with SWA, TP=4/EP=4) Each config follows the standard 4-step EAGLE3 offline pipeline: query → dump hidden states → train draft head → benchmark. Uses public slurm_factory and common/ script paths. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> Signed-off-by: Ye Yu <yeyu@nvidia.com>
Adds tools/launcher/examples/EAGLE3_TRIAGE.md, a living document for tracking EAGLE3 pipeline failure modes across new models. Content: - Mermaid decision-tree diagram mapping each pipeline step (query → dump → train → benchmark) to known failure modes and root causes - Model test result table with pass/fail/timeout status for 7 models tested on the OCI-HSG cluster (3 remaining to be tested) - 6 documented issues with symptoms, affected models, root causes, and fix recommendations: 1. dump_offline_data_vllm.sh missing (universal, all models) 2. offline_training.sh HF Hub upload bug (universal, all models) 3. Task 0 time limit exceeded (5 models) 4. GPT-OSS-20B tiktoken cache missing (model-specific) 5. trust_remote_code not passed to benchmark (MiniMax-M2.5) 6. DeepSeek-V3.2 task_1 OOM (model-specific) - Update instructions for adding new models or failure modes Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> Signed-off-by: Ye Yu <yeyu@nvidia.com>
From the nmm-okr30 sandbox MR (b838b171): - Add common/eagle3/dump_offline_data_hf.sh: HF-based (device_map=auto) hidden state extraction for models not supported by TRT-LLM. Handles VLMs, custom-code models, and architectures absent from TRT-LLM. - Update task_1 for 8 models to use dump_offline_data_hf.sh: MiniMax-M2.5, Ministral-3-8B, Ministral-3-14B, GPT-OSS-20B, Qwen3.5-9B, Qwen3.5-27B, Qwen3.5-35B-A3B, Step-3.5-Flash. Models that retain TRT-LLM dump: DeepSeek-V3.2, GLM-5, Kimi-K2.5, Kimi-K2.5-NVFP4 (all pure-text MoE with TRT-LLM support). - Update EAGLE3_TRIAGE.md with actual test results from 7 models run on OCI-HSG cluster on 2026-04-15, marking Issue 2 (HF upload bug) as FIXED and correcting Issue 1 status (hf script created as fix). Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> Signed-off-by: Ye Yu <yeyu@nvidia.com>
|
Note Reviews pausedIt looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the Use the following commands to manage reviews:
Use the checkboxes below for quick actions:
📝 WalkthroughWalkthroughAdds HF- and vLLM-based hidden-state dumping launchers and a vLLM Python extractor, many per-model HF pipeline YAMLs chaining data synthesis → hidden-state dump → EAGLE3 draft-head training → benchmark, triage/diagnostics docs, and a scheduler tweak for allow-to-fail pipelines. ChangesEAGLE3 HuggingFace Offline Pipeline
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~20 minutes Important Pre-merge checks failedPlease resolve all errors before merging. Addressing warnings is optional. ❌ Failed checks (1 error, 1 warning)
✅ Passed checks (4 passed)
✨ Finishing Touches🧪 Generate unit tests (beta)
Comment |
There was a problem hiding this comment.
Actionable comments posted: 20
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@tools/launcher/common/eagle3/dump_offline_data_hf.sh`:
- Around line 50-52: Add an explicit guard that checks the HF_MODEL_CKPT
environment variable is set and non-empty before running the Python command: if
HF_MODEL_CKPT is missing or empty, print a clear error mentioning HF_MODEL_CKPT
and exit with a non-zero status so the subsequent call to
compute_hidden_states_hf.py (the python3 invocation that uses --model
${HF_MODEL_CKPT} and --dp-rank ${TASK_ID}) never runs; keep the check
immediately above the python3 invocation.
- Line 34: The pip installation currently suppresses errors using "pip install
datasets 2>/dev/null || true" which hides failures while the script imports
datasets (at the top) and later calls load_dataset("json", ...); remove the
error suppression so installation failures are visible by replacing that
invocation with a plain failing install (e.g., "pip install datasets") or ensure
the environment pre-installs the dependency; update the launcher script around
the pip install line and any documentation/bootstrap steps accordingly so
ImportError/debug info surfaces immediately.
- Around line 36-48: The script uses unsafe parameter expansion and masks
errors; update checks to use quoted defaults like [ -z
"${SLURM_ARRAY_TASK_ID:-}" ] and [ -z "${SLURM_ARRAY_TASK_COUNT:-}" ] and assign
TASK_ID and TASK_COUNT from the quoted values (e.g.,
TASK_ID="${SLURM_ARRAY_TASK_ID}" ) to avoid "unary operator expected" errors;
forward arguments using "$@" instead of ${@} to preserve multi-word args; stop
silencing pip failures by removing `2>/dev/null || true` from the `pip install
datasets` invocation so installs surface errors; and add a validation check for
HF_MODEL_CKPT (e.g., [ -n "${HF_MODEL_CKPT:-}" ] with an explanatory error and
exit) before it is used.
In `@tools/launcher/examples/DeepSeek/DeepSeek-V3.2/hf_offline_eagle3.yaml`:
- Around line 33-34: The YAML global_vars block that defines hf_model (hf_model:
/hf-local/deepseek-ai/DeepSeek-V3.2) is missing the required environment
variables; set MLM_MODEL_CFG to the HuggingFace repo ID for this model (e.g.,
deepseek-ai/DeepSeek-V3.2 or the canonical repo string) and set QUANT_CFG to the
chosen quantization profile (e.g., NVFP4_DEFAULT_CFG or INT8_DEFAULT_CFG)
alongside hf_model in the same global_vars section so the launcher picks up
MLM_MODEL_CFG and QUANT_CFG for this config (apply same change for the other
blocks noted around lines 46-47, 64-66, 104-106).
In `@tools/launcher/examples/EAGLE3_TRIAGE.md`:
- Around line 11-37: Several fenced code blocks in EAGLE3_TRIAGE.md are missing
language tags and required surrounding blank lines; update the fences around the
"Model checkpoint (HuggingFace)" block, the "ValueError: The repository ..."
block, and the "Per-model results template" block to use explicit language
markers (e.g., ```text or ```markdown) and ensure there is a blank line before
and after each fence so markdownlint stops warning; specifically edit the fences
surrounding the literal text "Model checkpoint (HuggingFace)", the error snippet
starting "ValueError: The repository ... contains custom code...", and the
"Per-model results template" section to add the appropriate ```text/```markdown
tags and blank lines.
In `@tools/launcher/examples/GLM/GLM-5/hf_offline_eagle3.yaml`:
- Around line 33-35: The YAML config for the new GLM-5 model sets hf_model but
omits required environment wiring for MLM_MODEL_CFG and QUANT_CFG; update the
global_vars block to include MLM_MODEL_CFG with the HuggingFace repo id
(matching hf_model value) and a default QUANT_CFG (e.g., NVFP4_DEFAULT_CFG or
INT8_DEFAULT_CFG), and ensure each task's environment entries (the task-level
environment maps referenced in this file) also set MLM_MODEL_CFG and QUANT_CFG
so downstream launcher logic receives consistent model metadata.
- Around line 94-103: The task's args block is missing the required
--trust-remote-code flag which prevents models that use custom model/tokenizer
classes from loading; update the args list (the args block shown with
--speculative_algorithm EAGLE3 and --mtbench ...) to include --trust-remote-code
true (or just --trust-remote-code if CLI accepts flag style) so task_3 forwards
trust to remote code when launching the model.
In `@tools/launcher/examples/MiniMax/MiniMax-M2.5/hf_offline_eagle3.yaml`:
- Around line 31-32: The model config under global_vars only sets hf_model; add
the required environment variables MLM_MODEL_CFG and QUANT_CFG alongside
hf_model: set MLM_MODEL_CFG to the HuggingFace repo ID for this model (e.g.,
"MiniMaxAI/MiniMax-M2.5") and set QUANT_CFG to the chosen quantization profile
(e.g., "NVFP4_DEFAULT_CFG" or "INT8_DEFAULT_CFG"); update each affected
global_vars block (the one containing hf_model and the other similar blocks
referenced) so every launcher job has both MLM_MODEL_CFG and QUANT_CFG defined.
- Around line 89-105: task_3 in the YAML (the specdec benchmark using script
common/specdec_bench/quick_check.sh) is missing the required --trust-remote-code
flag for the MiniMax custom architecture; update the task_3 args list to include
- --trust-remote-code alongside the other flags (e.g., with
--speculative_algorithm EAGLE3) so the benchmark runs with trust_remote_code
enabled.
In `@tools/launcher/examples/Mistral/Ministral-3-14B/hf_offline_eagle3.yaml`:
- Around line 28-30: The new model block using global_vars hf_model:
/hf-local/mistralai/Ministral-3-14B-Instruct-2512-BF16 is missing the required
environment metadata; add MLM_MODEL_CFG and QUANT_CFG entries alongside hf_model
so launcher jobs get consistent model-id and quant-config metadata — set
MLM_MODEL_CFG to the HuggingFace repo ID for MinistraI-3-14B (the canonical repo
string) and set QUANT_CFG to the appropriate quantization constant (e.g.,
NVFP4_DEFAULT_CFG or INT8_DEFAULT_CFG) in the same global_vars section; apply
the same addition for the other mentioned blocks (lines ~41-43, 59-60, 99-101)
where hf_model entries exist.
In `@tools/launcher/examples/Mistral/Ministral-3-8B/hf_offline_eagle3.yaml`:
- Around line 27-29: The new model YAML sets global_vars.hf_model but is missing
the required environment vars MLM_MODEL_CFG and QUANT_CFG; update the
global_vars block in hf_offline_eagle3.yaml (and the other instances referenced
around lines 40-42, 58-60, 98-100) to add MLM_MODEL_CFG with the HuggingFace
repo ID for the model and QUANT_CFG with the appropriate quantization profile
(e.g., NVFP4_DEFAULT_CFG or INT8_DEFAULT_CFG) so both variables are present
alongside hf_model for the new model config.
In `@tools/launcher/examples/MoonshotAI/Kimi-K2.5/hf_offline_eagle3.yaml`:
- Around line 34-35: Add the missing required environment metadata by setting
MLM_MODEL_CFG to the HuggingFace repo ID and QUANT_CFG to the chosen
quantization profile for each model block; specifically update the global_vars
block containing hf_model (/hf-local/moonshotai/Kimi-K2.5) to include
MLM_MODEL_CFG: moonshotai/Kimi-K2.5 and QUANT_CFG: <choose e.g.,
NVFP4_DEFAULT_CFG or INT8_DEFAULT_CFG>, and make the same additions to the other
model blocks referenced (around the lines noted: 47-48, 65-66, 105-107) so every
model config includes both MLM_MODEL_CFG and QUANT_CFG keys.
In `@tools/launcher/examples/NVIDIA/Kimi-K2.5-NVFP4/hf_offline_eagle3.yaml`:
- Around line 40-43: global_vars block for this model only defines hf_model and
hf_model_bf16 but is missing the required environment variables; add
MLM_MODEL_CFG and QUANT_CFG entries (e.g., MLM_MODEL_CFG:
"nvidia/Kimi-K2.5-NVFP4" and QUANT_CFG: "NVFP4_DEFAULT_CFG" or the appropriate
quant config) to the model's task environment or shared vars so the launcher
conventions are satisfied; update the same pattern for the other model blocks
referenced (the additional hf_model/hf_model_bf16 groups) by wiring
MLM_MODEL_CFG and QUANT_CFG via task environments or shared/global vars with
interpolation.
- Around line 102-110: The benchmark args block for this Kimi model family must
include the trust flag to allow remote tokenizer code; update the args list (the
same args block used by task_3/benchmark step) to add "--trust-remote-code" (or
"--trust-remote-code true") so the benchmark initialization for the Kimi
tokenizer will not fail; locate the args sequence in the YAML (the block
containing --draft_model_dir, --draft_length, etc.) and append the
--trust-remote-code flag.
In `@tools/launcher/examples/OpenAI/GPT-OSS-20B/hf_offline_eagle3.yaml`:
- Around line 28-30: The YAML is missing required environment variables: set
MLM_MODEL_CFG to the HuggingFace repo ID and QUANT_CFG to the chosen
quantization config (e.g., NVFP4_DEFAULT_CFG or INT8_DEFAULT_CFG) alongside the
existing hf_model; add these keys under global_vars (or in the
shared/interpolated env for all tasks) and ensure each task environment block
that references hf_model also includes MLM_MODEL_CFG and QUANT_CFG so the
launcher can pick them up consistently.
- Around line 91-104: task_3's runtime invocation is missing the required
--trust-remote-code flag and the TIKTOKEN_RS_CACHE_DIR environment export;
update the args array (the block containing --speculative_algorithm EAGLE3,
--engine VLLM, etc.) to include --trust-remote-code, and add a
TIKTOKEN_RS_CACHE_DIR entry to the environment list (alongside HF_LOCAL and
HF_MODEL_CKPT) pointing to the intended cache dir (e.g., /tiktoken_rs_cache or
the project's cache path) so the model runs with the same runtime settings as
earlier tasks.
In `@tools/launcher/examples/Qwen/Qwen3.5-27B/hf_offline_eagle3.yaml`:
- Around line 24-26: The pipeline config's global_vars block defines hf_model
but is missing the required environment variables MLM_MODEL_CFG and QUANT_CFG;
update the global_vars section (where hf_model is set) to add MLM_MODEL_CFG with
the HuggingFace repo ID for this model and QUANT_CFG with the appropriate
quantization profile (e.g., NVFP4_DEFAULT_CFG or INT8_DEFAULT_CFG) so the entry
for Qwen3.5-27B conforms to launcher metadata conventions; apply the same
addition to the other occurrences noted (around lines 36-38, 54-56, 94-97) to
keep all model configs consistent.
In `@tools/launcher/examples/Qwen/Qwen3.5-35B-A3B/hf_offline_eagle3.yaml`:
- Around line 29-31: The YAML global_vars block defines hf_model but is missing
the required environment variables MLM_MODEL_CFG and QUANT_CFG; add
MLM_MODEL_CFG with the HuggingFace repo ID for this model and set QUANT_CFG to
the appropriate quant config (e.g., NVFP4_DEFAULT_CFG or INT8_DEFAULT_CFG) in
the same global_vars section and ensure any places referencing this model (the
blocks around the existing hf_model entries) propagate these two env vars so the
launcher receives both model metadata and quant config.
In `@tools/launcher/examples/Qwen/Qwen3.5-9B/hf_offline_eagle3.yaml`:
- Around line 24-25: Add the mandatory environment entries MLM_MODEL_CFG and
QUANT_CFG next to the existing global_vars hf_model definition (hf_model:
/hf-local/Qwen/Qwen3.5-9B): set MLM_MODEL_CFG to the HuggingFace repo ID for
this model and set QUANT_CFG to the chosen quantization profile (e.g.,
NVFP4_DEFAULT_CFG or INT8_DEFAULT_CFG). Repeat the same addition for the other
hf_model blocks present later in the file (the other occurrences of global_vars
with hf_model) so all model configs include both required env vars.
In `@tools/launcher/examples/StepFun/Step-3.5-Flash/hf_offline_eagle3.yaml`:
- Around line 34-35: The new model config only sets hf_model but omits the
required environment metadata; add MLM_MODEL_CFG with the HuggingFace repo ID
(e.g., the same value as hf_model or the canonical HF repo string) and set
QUANT_CFG to the appropriate quantization profile (e.g., NVFP4_DEFAULT_CFG or
INT8_DEFAULT_CFG) inside the same global_vars block so launcher code sees
standardized envs; apply the same pattern for the other model blocks in this
file that define hf_model (the other global_vars entries referenced in the
comment).
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Enterprise
Run ID: 76d008e0-290f-420a-9842-78a95a3f5988
📒 Files selected for processing (14)
tools/launcher/common/eagle3/dump_offline_data_hf.shtools/launcher/examples/DeepSeek/DeepSeek-V3.2/hf_offline_eagle3.yamltools/launcher/examples/EAGLE3_TRIAGE.mdtools/launcher/examples/GLM/GLM-5/hf_offline_eagle3.yamltools/launcher/examples/MiniMax/MiniMax-M2.5/hf_offline_eagle3.yamltools/launcher/examples/Mistral/Ministral-3-14B/hf_offline_eagle3.yamltools/launcher/examples/Mistral/Ministral-3-8B/hf_offline_eagle3.yamltools/launcher/examples/MoonshotAI/Kimi-K2.5/hf_offline_eagle3.yamltools/launcher/examples/NVIDIA/Kimi-K2.5-NVFP4/hf_offline_eagle3.yamltools/launcher/examples/OpenAI/GPT-OSS-20B/hf_offline_eagle3.yamltools/launcher/examples/Qwen/Qwen3.5-27B/hf_offline_eagle3.yamltools/launcher/examples/Qwen/Qwen3.5-35B-A3B/hf_offline_eagle3.yamltools/launcher/examples/Qwen/Qwen3.5-9B/hf_offline_eagle3.yamltools/launcher/examples/StepFun/Step-3.5-Flash/hf_offline_eagle3.yaml
| # --input-data, --output-dir, --max-seq-len, etc. | ||
| ################################################################################################### | ||
|
|
||
| pip install datasets 2>/dev/null || true |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
cat -n tools/launcher/common/eagle3/dump_offline_data_hf.shRepository: NVIDIA/Model-Optimizer
Length of output: 2458
🏁 Script executed:
find . -name "compute_hidden_states_hf.py" -type fRepository: NVIDIA/Model-Optimizer
Length of output: 148
🏁 Script executed:
cat -n ./examples/speculative_decoding/collect_hidden_states/compute_hidden_states_hf.py | head -100Repository: NVIDIA/Model-Optimizer
Length of output: 3844
🏁 Script executed:
grep -n "load_dataset" ./examples/speculative_decoding/collect_hidden_states/compute_hidden_states_hf.py | head -5Repository: NVIDIA/Model-Optimizer
Length of output: 230
Remove error suppression from pip install since datasets is a required dependency.
pip install datasets 2>/dev/null || true on line 34 masks installation failures completely. The Python script imports datasets at line 31 and uses it at lines 110+ (e.g., load_dataset("json", ...)). If installation fails silently, the job will fail later with an ImportError instead of revealing the actual installation error, making debugging difficult and behavior non-reproducible. Either ensure the dependency is pre-installed in the execution environment, or let the installation command fail visibly so errors are clear.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@tools/launcher/common/eagle3/dump_offline_data_hf.sh` at line 34, The pip
installation currently suppresses errors using "pip install datasets 2>/dev/null
|| true" which hides failures while the script imports datasets (at the top) and
later calls load_dataset("json", ...); remove the error suppression so
installation failures are visible by replacing that invocation with a plain
failing install (e.g., "pip install datasets") or ensure the environment
pre-installs the dependency; update the launcher script around the pip install
line and any documentation/bootstrap steps accordingly so ImportError/debug info
surfaces immediately.
| if [ -z ${SLURM_ARRAY_TASK_ID} ]; then | ||
| TASK_ID=0 | ||
| else | ||
| echo "SLURM_ARRAY_TASK_ID ${SLURM_ARRAY_TASK_ID}" | ||
| TASK_ID=${SLURM_ARRAY_TASK_ID} | ||
| fi | ||
|
|
||
| if [ -z ${SLURM_ARRAY_TASK_COUNT} ]; then | ||
| TASK_COUNT=1 | ||
| else | ||
| echo "SLURM_ARRAY_TASK_COUNT ${SLURM_ARRAY_TASK_COUNT}" | ||
| TASK_COUNT=${SLURM_ARRAY_TASK_COUNT} | ||
| fi |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
#!/bin/bash
rg -n '\[ -z \${SLURM_ARRAY_TASK_(ID|COUNT)\} \]|\$\{@\}' tools/launcher/common/eagle3/dump_offline_data_hf.shRepository: NVIDIA/Model-Optimizer
Length of output: 207
🏁 Script executed:
cat -n tools/launcher/common/eagle3/dump_offline_data_hf.shRepository: NVIDIA/Model-Optimizer
Length of output: 2458
🏁 Script executed:
# Search for the patterns mentioned in the review with corrected regex
rg -n '\[ -z \$' tools/launcher/common/eagle3/dump_offline_data_hf.sh -A 2Repository: NVIDIA/Model-Optimizer
Length of output: 209
🏁 Script executed:
# Also search for ${@} pattern
rg -n '\$\{@\}' tools/launcher/common/eagle3/dump_offline_data_hf.shRepository: NVIDIA/Model-Optimizer
Length of output: 78
SLURM array defaults and argument forwarding have unsafe parameter expansion that will cause bash errors.
Unquoted [ -z ${...} ] will fail with "unary operator expected" when variables are unset. Unquoted ${@} breaks multi-word argument handling. Use [ -z "${VAR:-}" ] and "$@" to make this deterministic and safe.
Also, pip install datasets 2>/dev/null || true masks installation failures, and ${HF_MODEL_CKPT} is used without validation.
Proposed fix
-if [ -z ${SLURM_ARRAY_TASK_ID} ]; then
- TASK_ID=0
-else
- echo "SLURM_ARRAY_TASK_ID ${SLURM_ARRAY_TASK_ID}"
- TASK_ID=${SLURM_ARRAY_TASK_ID}
-fi
+if [ -z "${SLURM_ARRAY_TASK_ID:-}" ]; then
+ TASK_ID=0
+else
+ echo "SLURM_ARRAY_TASK_ID ${SLURM_ARRAY_TASK_ID}"
+ TASK_ID="${SLURM_ARRAY_TASK_ID}"
+fi
-if [ -z ${SLURM_ARRAY_TASK_COUNT} ]; then
- TASK_COUNT=1
-else
- echo "SLURM_ARRAY_TASK_COUNT ${SLURM_ARRAY_TASK_COUNT}"
- TASK_COUNT=${SLURM_ARRAY_TASK_COUNT}
-fi
+if [ -z "${SLURM_ARRAY_TASK_COUNT:-}" ]; then
+ TASK_COUNT=1
+else
+ echo "SLURM_ARRAY_TASK_COUNT ${SLURM_ARRAY_TASK_COUNT}"
+ TASK_COUNT="${SLURM_ARRAY_TASK_COUNT}"
+fi
python3 modules/Model-Optimizer/examples/speculative_decoding/collect_hidden_states/compute_hidden_states_hf.py \
@@
- ${@}
+ "$@"📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| if [ -z ${SLURM_ARRAY_TASK_ID} ]; then | |
| TASK_ID=0 | |
| else | |
| echo "SLURM_ARRAY_TASK_ID ${SLURM_ARRAY_TASK_ID}" | |
| TASK_ID=${SLURM_ARRAY_TASK_ID} | |
| fi | |
| if [ -z ${SLURM_ARRAY_TASK_COUNT} ]; then | |
| TASK_COUNT=1 | |
| else | |
| echo "SLURM_ARRAY_TASK_COUNT ${SLURM_ARRAY_TASK_COUNT}" | |
| TASK_COUNT=${SLURM_ARRAY_TASK_COUNT} | |
| fi | |
| if [ -z "${SLURM_ARRAY_TASK_ID:-}" ]; then | |
| TASK_ID=0 | |
| else | |
| echo "SLURM_ARRAY_TASK_ID ${SLURM_ARRAY_TASK_ID}" | |
| TASK_ID="${SLURM_ARRAY_TASK_ID}" | |
| fi | |
| if [ -z "${SLURM_ARRAY_TASK_COUNT:-}" ]; then | |
| TASK_COUNT=1 | |
| else | |
| echo "SLURM_ARRAY_TASK_COUNT ${SLURM_ARRAY_TASK_COUNT}" | |
| TASK_COUNT="${SLURM_ARRAY_TASK_COUNT}" | |
| fi |
🧰 Tools
🪛 Shellcheck (0.11.0)
[info] 36-36: Double quote to prevent globbing and word splitting.
(SC2086)
[info] 43-43: Double quote to prevent globbing and word splitting.
(SC2086)
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@tools/launcher/common/eagle3/dump_offline_data_hf.sh` around lines 36 - 48,
The script uses unsafe parameter expansion and masks errors; update checks to
use quoted defaults like [ -z "${SLURM_ARRAY_TASK_ID:-}" ] and [ -z
"${SLURM_ARRAY_TASK_COUNT:-}" ] and assign TASK_ID and TASK_COUNT from the
quoted values (e.g., TASK_ID="${SLURM_ARRAY_TASK_ID}" ) to avoid "unary operator
expected" errors; forward arguments using "$@" instead of ${@} to preserve
multi-word args; stop silencing pip failures by removing `2>/dev/null || true`
from the `pip install datasets` invocation so installs surface errors; and add a
validation check for HF_MODEL_CKPT (e.g., [ -n "${HF_MODEL_CKPT:-}" ] with an
explanatory error and exit) before it is used.
| python3 modules/Model-Optimizer/examples/speculative_decoding/collect_hidden_states/compute_hidden_states_hf.py \ | ||
| --model ${HF_MODEL_CKPT} \ | ||
| --dp-rank ${TASK_ID} \ |
There was a problem hiding this comment.
Validate HF_MODEL_CKPT before launching Python.
This script documents HF_MODEL_CKPT as required, but never checks it. Add an explicit guard so failures are immediate and actionable.
Proposed fix
+if [ -z "${HF_MODEL_CKPT:-}" ]; then
+ echo "HF_MODEL_CKPT is required"
+ exit 1
+fi
+
python3 modules/Model-Optimizer/examples/speculative_decoding/collect_hidden_states/compute_hidden_states_hf.py \📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| python3 modules/Model-Optimizer/examples/speculative_decoding/collect_hidden_states/compute_hidden_states_hf.py \ | |
| --model ${HF_MODEL_CKPT} \ | |
| --dp-rank ${TASK_ID} \ | |
| if [ -z "${HF_MODEL_CKPT:-}" ]; then | |
| echo "HF_MODEL_CKPT is required" | |
| exit 1 | |
| fi | |
| python3 modules/Model-Optimizer/examples/speculative_decoding/collect_hidden_states/compute_hidden_states_hf.py \ | |
| --model ${HF_MODEL_CKPT} \ | |
| --dp-rank ${TASK_ID} \ |
🧰 Tools
🪛 Shellcheck (0.11.0)
[info] 51-51: Double quote to prevent globbing and word splitting.
(SC2086)
[info] 52-52: Double quote to prevent globbing and word splitting.
(SC2086)
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@tools/launcher/common/eagle3/dump_offline_data_hf.sh` around lines 50 - 52,
Add an explicit guard that checks the HF_MODEL_CKPT environment variable is set
and non-empty before running the Python command: if HF_MODEL_CKPT is missing or
empty, print a clear error mentioning HF_MODEL_CKPT and exit with a non-zero
status so the subsequent call to compute_hidden_states_hf.py (the python3
invocation that uses --model ${HF_MODEL_CKPT} and --dp-rank ${TASK_ID}) never
runs; keep the check immediately above the python3 invocation.
| global_vars: | ||
| hf_model: /hf-local/deepseek-ai/DeepSeek-V3.2 |
There was a problem hiding this comment.
Include mandatory MLM_MODEL_CFG and QUANT_CFG env variables
This new model config does not set the required model repo-id and quant profile envs, which can break standardized launcher behavior across tasks.
As per coding guidelines, Set MLM_MODEL_CFG environment variable to the HuggingFace repo ID when adding a new model config and Set QUANT_CFG environment variable (e.g., NVFP4_DEFAULT_CFG, INT8_DEFAULT_CFG) when adding a new model config.
Also applies to: 46-47, 64-66, 104-106
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@tools/launcher/examples/DeepSeek/DeepSeek-V3.2/hf_offline_eagle3.yaml` around
lines 33 - 34, The YAML global_vars block that defines hf_model (hf_model:
/hf-local/deepseek-ai/DeepSeek-V3.2) is missing the required environment
variables; set MLM_MODEL_CFG to the HuggingFace repo ID for this model (e.g.,
deepseek-ai/DeepSeek-V3.2 or the canonical repo string) and set QUANT_CFG to the
chosen quantization profile (e.g., NVFP4_DEFAULT_CFG or INT8_DEFAULT_CFG)
alongside hf_model in the same global_vars section so the launcher picks up
MLM_MODEL_CFG and QUANT_CFG for this config (apply same change for the other
blocks noted around lines 46-47, 64-66, 104-106).
There was a problem hiding this comment.
Note (applies to all the MLM_MODEL_CFG/QUANT_CFG comments across these YAMLs): these env vars belong to the megatron PTQ flow (common/megatron_lm/quantize/), not the EAGLE3 offline dump flow. The hf_offline_eagle3.yaml / eagle3_quick_check.yaml configs intentionally do not set them — the model is identified via HF_MODEL_CKPT and there's no quant-cfg selection step. Marking these as not applicable.
There was a problem hiding this comment.
@yeyu-nvidia, thanks for the clarification — that makes sense. MLM_MODEL_CFG and QUANT_CFG are megatron PTQ conventions and don't belong in the EAGLE3 offline dump flow, which uses HF_MODEL_CKPT instead. I'll record this to avoid the same false positive on future EAGLE3 configs.
🐇 📝
✏️ Learnings added
Learnt from: yeyu-nvidia
Repo: NVIDIA/Model-Optimizer PR: 1417
File: tools/launcher/examples/DeepSeek/DeepSeek-V3.2/hf_offline_eagle3.yaml:33-34
Timestamp: 2026-06-03T17:17:56.244Z
Learning: In NVIDIA/Model-Optimizer, `MLM_MODEL_CFG` and `QUANT_CFG` environment variables belong exclusively to the **megatron PTQ** flow (`tools/launcher/common/megatron_lm/quantize/`). Do NOT flag their absence in EAGLE3 offline pipeline configs (`hf_offline_eagle3.yaml`, `eagle3_quick_check.yaml`, etc.). In the EAGLE3 offline dump flow, the model is identified via `HF_MODEL_CKPT` and there is no quantization-config selection step — `MLM_MODEL_CFG` and `QUANT_CFG` are intentionally absent from these configs.
Note: Learnings are effective only in the context of similar code segments. To apply general coding guidelines, please configure review instructions. You can manage existing learnings in the app.
🧠 Learnings used
Learnt from: ChenhanYu
Repo: NVIDIA/Model-Optimizer PR: 1496
File: tools/launcher/examples/moonshotai/Kimi-K2.5/hf_offline_dflash.yaml:15-15
Timestamp: 2026-05-21T19:18:55.357Z
Learning: In `tools/launcher/examples/moonshotai/Kimi-K2.5/hf_offline_dflash.yaml`, the `global_vars.hf_model` path `/hf-local/moonshotai/Kimi-K2.6` is an intentional stand-in for Kimi-K2.5 weights that are not yet staged on cw-dfw. This is documented in the JIRA spec for OMNIML-4740 and the commit message body. Do not flag this as a mismatch during review until the K2.5 weights are confirmed staged.
Learnt from: h-guo18
Repo: NVIDIA/Model-Optimizer PR: 1052
File: modelopt/torch/speculative/utils.py:514-538
Timestamp: 2026-03-25T23:40:32.596Z
Learning: In NVIDIA/Model-Optimizer, Kimi-K2.5 (moonshotai/Kimi-K2.5) only supports the fake base model workflow for offline EAGLE training (i.e., `use_fake_base=True` is always set). It never exercises the non-fake-base offline training path in `load_vlm_or_llm` (modelopt/torch/speculative/utils.py). Do not flag VLM config unwrapping issues in that path as applicable to Kimi-K2.5.
Learnt from: CR
Repo: NVIDIA/Model-Optimizer PR: 0
File: tools/launcher/CLAUDE.md:0-0
Timestamp: 2026-03-18T22:32:46.809Z
Learning: Applies to tools/launcher/**/*.yaml : Set GPU and node counts in slurm_config based on model size when adding a new model config
Learnt from: CR
Repo: NVIDIA/Model-Optimizer PR: 0
File: tools/launcher/CLAUDE.md:0-0
Timestamp: 2026-03-18T22:32:46.809Z
Learning: Applies to tools/launcher/**/*.yaml : Set `MLM_MODEL_CFG` environment variable to the HuggingFace repo ID when adding a new model config
Learnt from: CR
Repo: NVIDIA/Model-Optimizer PR: 0
File: tools/launcher/CLAUDE.md:0-0
Timestamp: 2026-03-18T22:32:46.809Z
Learning: Applies to tools/launcher/**/*.yaml : Set `QUANT_CFG` environment variable (e.g., `NVFP4_DEFAULT_CFG`, `INT8_DEFAULT_CFG`) when adding a new model config
Learnt from: CR
Repo: NVIDIA/Model-Optimizer PR: 0
File: tools/launcher/CLAUDE.md:0-0
Timestamp: 2026-03-18T22:32:46.809Z
Learning: Applies to tools/launcher/examples/**/*.yaml : Create model configs in `examples/<Org>/<Model>/megatron_lm_ptq.yaml` following the standard YAML format for model quantization jobs
Learnt from: shengliangxu
Repo: NVIDIA/Model-Optimizer PR: 1423
File: modelopt_recipes/configs/ptq/presets/model/nvfp4_mlp_only.yaml:27-37
Timestamp: 2026-05-18T18:56:18.772Z
Learning: In `modelopt_recipes/configs/ptq/presets/model/nvfp4_mlp_only.yaml`, the overlapping quantizer patterns `*mlp*weight_quantizer` / `*mlp*input_quantizer` (inline) and `*.experts.*weight_quantizer` / `*.experts.*input_quantizer` (via imported `experts_nvfp4` unit) are intentional. This mirrors the original Python `_nvfp4_selective_quant_cfg(["*mlp*", "*block_sparse_moe*", "*.experts.*"])` behavior. The `experts_nvfp4` import is needed to cover expert layers NOT nested under `mlp` (e.g., `decoder.experts.X.weight_quantizer`). Both paths apply the same NVFP4 config so the overlap is idempotent and harmless.
Learnt from: realAsma
Repo: NVIDIA/Model-Optimizer PR: 953
File: modelopt/torch/quantization/algorithms.py:1302-1311
Timestamp: 2026-03-10T20:00:31.206Z
Learning: In NVIDIA/Model-Optimizer, the `"default"` key in auto_quantize recipe quant_cfg (e.g., FP8_DEFAULT_CFG, NVFP4_DEFAULT_CFG) is always `{"enable": False}`. In `get_auto_quantize_config` (modelopt/torch/quantization/algorithms.py), the global `"*": {"enable": False}` fallback already covers this, and quantizer-enabling patterns like `*weight_quantizer` and `*input_quantizer` match correctly via `fnmatch`. Do not flag `_match_quantizer_cfg` for missing "default" key handling in the auto_quantize context.
Learnt from: shengliangxu
Repo: NVIDIA/Model-Optimizer PR: 1423
File: modelopt_recipes/configs/ptq/presets/model/w4a8_mxfp4_fp8.yaml:25-25
Timestamp: 2026-05-18T18:55:54.450Z
Learning: In the NVIDIA/Model-Optimizer repository, MXFP-family quantization configs (`MXFP8_DEFAULT_CFG`, `MXFP6_DEFAULT_CFG`, `MXFP4_DEFAULT_CFG`, `MXINT8_DEFAULT_CFG`, `MXFP4_MLP_WEIGHT_ONLY_CFG`) and `W4A8_MXFP4_FP8_CFG` intentionally use `algorithm: None` (YAML null). Setting `algorithm: max` for these configs would change behavior. The YAML preset `modelopt_recipes/configs/ptq/presets/model/w4a8_mxfp4_fp8.yaml` correctly uses `algorithm:` (null) to match this convention.
Learnt from: benchislett
Repo: NVIDIA/Model-Optimizer PR: 1044
File: modelopt/torch/speculative/plugins/transformers.py:909-911
Timestamp: 2026-03-16T19:22:38.287Z
Learning: In NVIDIA/Model-Optimizer, `eagle_ttt_steps=0` is not a meaningful or valid configuration for EAGLE training. Do not flag potential issues (e.g., undefined loop variables) that only arise when `eagle_ttt_steps=0`, as this value is never expected to be used in practice.
Learnt from: h-guo18
Repo: NVIDIA/Model-Optimizer PR: 1328
File: modelopt/recipe/config.py:28-33
Timestamp: 2026-05-15T22:32:26.119Z
Learning: In NVIDIA/Model-Optimizer (`modelopt/recipe/config.py`), importing internal sibling modules such as `DFlashConfig`, `EagleConfig`, `MedusaConfig`, `SpecDataArgs`, `SpecModelArgs`, and `SpecTrainingArgs` at module level is intentional and correct. These classes are used as Pydantic `ModeloptField(default=..., validate_default=True)` values at class-definition time, so `TYPE_CHECKING` guards or lazy loading cannot be applied. Do not flag these as needing lazy imports in future reviews.
Learnt from: shengliangxu
Repo: NVIDIA/Model-Optimizer PR: 1506
File: modelopt_recipes/huggingface/mpt/ptq/w4a8_awq-kv_fp8_cast.yaml:39-42
Timestamp: 2026-05-21T16:30:37.331Z
Learning: In modelopt YAML PTQ recipes (e.g., `modelopt_recipes/huggingface/mpt/ptq/w4a8_awq-kv_fp8_cast.yaml`), the `*weight_quantizer` cfg list with both `$import: int4_per_block` and `$import: fp8` is correct and intentional for the W4A8 AWQ scheme. It resolves to two `QuantizerAttributeConfig` entries (INT4 block storage + FP8 attribute layer) that together implement W4A8 AWQ — they are NOT mutually exclusive formats. The canonical preset `modelopt_recipes/configs/ptq/presets/model/w4a8_awq_beta.yaml` has the identical structure. Do not flag this list-form cfg as a conflict.
Learnt from: shengliangxu
Repo: NVIDIA/Model-Optimizer PR: 1253
File: modelopt_recipes/configs/numerics/nvfp4.yaml:18-21
Timestamp: 2026-04-15T23:30:08.441Z
Learning: In NVIDIA Model-Optimizer (ModelOpt), within a quantizer `cfg` block, `type` (e.g., `dynamic` or `static`) and `scale_bits` (e.g., `e4m3`) are sub-keys nested **inside** `block_sizes`, not siblings of it. The correct NVFP4 quantizer structure is:
```yaml
block_sizes:
-1: 16
type: dynamic
scale_bits: e4m3
num_bits: e2m1
```
This applies to all NVFP4 snippet files such as `modelopt_recipes/configs/numerics/nvfp4.yaml` and `modelopt_recipes/configs/numerics/nvfp4_static.yaml`.
| args: | ||
| - --draft_model_dir /scratchspace/export | ||
| - --draft_length 3 | ||
| - --output_length 4096 | ||
| - --engine VLLM | ||
| - --tp_size 4 | ||
| - --ep_size 1 | ||
| - --speculative_algorithm EAGLE3 | ||
| - --mtbench /hf-local/HuggingFaceH4/mt_bench_prompts/raw/question.jsonl | ||
| - --concurrency 1 | ||
| environment: | ||
| - HF_LOCAL: /hf-local | ||
| - HF_MODEL_CKPT: <<global_vars.hf_model>> | ||
| slurm_config: |
There was a problem hiding this comment.
Benchmark step omits required GPT-OSS runtime settings.
task_3 is missing --trust-remote-code and does not export TIKTOKEN_RS_CACHE_DIR, even though both are required for this model family in earlier steps/comments.
Proposed fix
task_3:
args:
- --draft_model_dir /scratchspace/export
- --draft_length 3
- --output_length 4096
- --engine VLLM
+ - --trust-remote-code
- --tp_size 4
- --ep_size 1
@@
environment:
- HF_LOCAL: /hf-local
- HF_MODEL_CKPT: <<global_vars.hf_model>>
+ - TIKTOKEN_RS_CACHE_DIR: /hf-local/tiktoken_cache🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@tools/launcher/examples/OpenAI/GPT-OSS-20B/hf_offline_eagle3.yaml` around
lines 91 - 104, task_3's runtime invocation is missing the required
--trust-remote-code flag and the TIKTOKEN_RS_CACHE_DIR environment export;
update the args array (the block containing --speculative_algorithm EAGLE3,
--engine VLLM, etc.) to include --trust-remote-code, and add a
TIKTOKEN_RS_CACHE_DIR entry to the environment list (alongside HF_LOCAL and
HF_MODEL_CKPT) pointing to the intended cache dir (e.g., /tiktoken_rs_cache or
the project's cache path) so the model runs with the same runtime settings as
earlier tasks.
| global_vars: | ||
| hf_model: /hf-local/Qwen/Qwen3.5-27B | ||
|
|
There was a problem hiding this comment.
Required MLM_MODEL_CFG and QUANT_CFG are missing in this new model pipeline config.
Please add both env vars so this config matches launcher metadata conventions used by downstream tooling.
As per coding guidelines, Set MLM_MODEL_CFG environment variable to the HuggingFace repo ID when adding a new model config and Set QUANT_CFG environment variable (e.g., NVFP4_DEFAULT_CFG, INT8_DEFAULT_CFG) when adding a new model config.
Also applies to: 36-38, 54-56, 94-97
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@tools/launcher/examples/Qwen/Qwen3.5-27B/hf_offline_eagle3.yaml` around lines
24 - 26, The pipeline config's global_vars block defines hf_model but is missing
the required environment variables MLM_MODEL_CFG and QUANT_CFG; update the
global_vars section (where hf_model is set) to add MLM_MODEL_CFG with the
HuggingFace repo ID for this model and QUANT_CFG with the appropriate
quantization profile (e.g., NVFP4_DEFAULT_CFG or INT8_DEFAULT_CFG) so the entry
for Qwen3.5-27B conforms to launcher metadata conventions; apply the same
addition to the other occurrences noted (around lines 36-38, 54-56, 94-97) to
keep all model configs consistent.
| global_vars: | ||
| hf_model: /hf-local/Qwen/Qwen3.5-35B-A3B | ||
|
|
There was a problem hiding this comment.
New model config is missing MLM_MODEL_CFG and QUANT_CFG.
Please define and propagate both env vars in this YAML so the launcher has required model and quant metadata.
As per coding guidelines, Set MLM_MODEL_CFG environment variable to the HuggingFace repo ID when adding a new model config and Set QUANT_CFG environment variable (e.g., NVFP4_DEFAULT_CFG, INT8_DEFAULT_CFG) when adding a new model config.
Also applies to: 42-44, 59-61, 99-102
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@tools/launcher/examples/Qwen/Qwen3.5-35B-A3B/hf_offline_eagle3.yaml` around
lines 29 - 31, The YAML global_vars block defines hf_model but is missing the
required environment variables MLM_MODEL_CFG and QUANT_CFG; add MLM_MODEL_CFG
with the HuggingFace repo ID for this model and set QUANT_CFG to the appropriate
quant config (e.g., NVFP4_DEFAULT_CFG or INT8_DEFAULT_CFG) in the same
global_vars section and ensure any places referencing this model (the blocks
around the existing hf_model entries) propagate these two env vars so the
launcher receives both model metadata and quant config.
| global_vars: | ||
| hf_model: /hf-local/Qwen/Qwen3.5-9B |
There was a problem hiding this comment.
Set mandatory MLM_MODEL_CFG and QUANT_CFG in this new config
Please add both required env vars so this model config conforms to launcher standards and avoids downstream metadata gaps.
As per coding guidelines, Set MLM_MODEL_CFG environment variable to the HuggingFace repo ID when adding a new model config and Set QUANT_CFG environment variable (e.g., NVFP4_DEFAULT_CFG, INT8_DEFAULT_CFG) when adding a new model config.
Also applies to: 36-37, 54-55, 94-96
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@tools/launcher/examples/Qwen/Qwen3.5-9B/hf_offline_eagle3.yaml` around lines
24 - 25, Add the mandatory environment entries MLM_MODEL_CFG and QUANT_CFG next
to the existing global_vars hf_model definition (hf_model:
/hf-local/Qwen/Qwen3.5-9B): set MLM_MODEL_CFG to the HuggingFace repo ID for
this model and set QUANT_CFG to the chosen quantization profile (e.g.,
NVFP4_DEFAULT_CFG or INT8_DEFAULT_CFG). Repeat the same addition for the other
hf_model blocks present later in the file (the other occurrences of global_vars
with hf_model) so all model configs include both required env vars.
| global_vars: | ||
| hf_model: /hf-local/stepfun-ai/Step-3.5-Flash |
There was a problem hiding this comment.
Add required MLM_MODEL_CFG and QUANT_CFG for this new model config
This config is missing the required model/quant env metadata, so downstream launcher logic that depends on standardized envs can break or become inconsistent.
Suggested patch shape
global_vars:
hf_model: /hf-local/stepfun-ai/Step-3.5-Flash
+ mlm_model_cfg: stepfun-ai/Step-3.5-Flash
+ quant_cfg: <SET_QUANT_CFG>
task_0:
@@
environment:
- HF_LOCAL: /hf-local
+ - MLM_MODEL_CFG: <<global_vars.mlm_model_cfg>>
+ - QUANT_CFG: <<global_vars.quant_cfg>>
task_1:
@@
environment:
- HF_MODEL_CKPT: <<global_vars.hf_model>>
+ - MLM_MODEL_CFG: <<global_vars.mlm_model_cfg>>
+ - QUANT_CFG: <<global_vars.quant_cfg>>
task_3:
@@
environment:
- HF_LOCAL: /hf-local
- HF_MODEL_CKPT: <<global_vars.hf_model>>
+ - MLM_MODEL_CFG: <<global_vars.mlm_model_cfg>>
+ - QUANT_CFG: <<global_vars.quant_cfg>>As per coding guidelines, Set MLM_MODEL_CFG environment variable to the HuggingFace repo ID when adding a new model config and Set QUANT_CFG environment variable (e.g., NVFP4_DEFAULT_CFG, INT8_DEFAULT_CFG) when adding a new model config.
Also applies to: 47-48, 65-66, 105-107
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@tools/launcher/examples/StepFun/Step-3.5-Flash/hf_offline_eagle3.yaml` around
lines 34 - 35, The new model config only sets hf_model but omits the required
environment metadata; add MLM_MODEL_CFG with the HuggingFace repo ID (e.g., the
same value as hf_model or the canonical HF repo string) and set QUANT_CFG to the
appropriate quantization profile (e.g., NVFP4_DEFAULT_CFG or INT8_DEFAULT_CFG)
inside the same global_vars block so launcher code sees standardized envs; apply
the same pattern for the other model blocks in this file that define hf_model
(the other global_vars entries referenced in the comment).
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #1417 +/- ##
==========================================
- Coverage 77.41% 76.88% -0.54%
==========================================
Files 480 482 +2
Lines 52499 53387 +888
==========================================
+ Hits 40642 41044 +402
- Misses 11857 12343 +486
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
- Add compute_hidden_states_vllm.py: uses speculators.VllmHiddenStatesGenerator to extract hidden states via vLLM; output format identical to the HF variant - Add dump_offline_data_vllm.sh: launcher wrapper for the vLLM backend; three backends now available for task_1 (TRT-LLM / HF / vLLM) - Update EAGLE3_TRIAGE.md: mark Issue 1 as FIXED, update pipeline overview to show all three dump backends, update model matrix rows 1-7 to NEEDS RERUN (blocked by missing script in round 1, now resolved) Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> Signed-off-by: Ye Yu <yeyu@nvidia.com>
There was a problem hiding this comment.
Actionable comments posted: 3
🧹 Nitpick comments (1)
tools/launcher/common/eagle3/dump_offline_data_vllm.sh (1)
36-55: 💤 Low valueQuote shell variables to prevent word splitting.
Multiple variables lack quotes in test conditions and command arguments. While unlikely to cause issues in a typical SLURM environment, quoting is a shell best practice.
♻️ Shellcheck-compliant quoting
-if [ -z ${SLURM_ARRAY_TASK_ID} ]; then +if [ -z "${SLURM_ARRAY_TASK_ID}" ]; then TASK_ID=0 else echo "SLURM_ARRAY_TASK_ID ${SLURM_ARRAY_TASK_ID}" TASK_ID=${SLURM_ARRAY_TASK_ID} fi -if [ -z ${SLURM_ARRAY_TASK_COUNT} ]; then +if [ -z "${SLURM_ARRAY_TASK_COUNT}" ]; then TASK_COUNT=1 else echo "SLURM_ARRAY_TASK_COUNT ${SLURM_ARRAY_TASK_COUNT}" TASK_COUNT=${SLURM_ARRAY_TASK_COUNT} fi python3 modules/Model-Optimizer/examples/speculative_decoding/collect_hidden_states/compute_hidden_states_vllm.py \ - --model ${HF_MODEL_CKPT} \ - --dp-rank ${TASK_ID} \ - --dp-world-size ${TASK_COUNT} \ + --model "${HF_MODEL_CKPT}" \ + --dp-rank "${TASK_ID}" \ + --dp-world-size "${TASK_COUNT}" \ --trust_remote_code \ "$@"As per coding guidelines, shellcheck SC2086.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@tools/launcher/common/eagle3/dump_offline_data_vllm.sh` around lines 36 - 55, Several shell variables are unquoted which can cause word-splitting (SC2086): wrap parameter expansions like ${SLURM_ARRAY_TASK_ID}, ${SLURM_ARRAY_TASK_COUNT}, ${TASK_ID}, ${TASK_COUNT}, ${HF_MODEL_CKPT} and the positional expansion ${@} in double quotes; update the test conditions ([ -z "${SLURM_ARRAY_TASK_ID}" ] etc.) and the python invocation (--model "${HF_MODEL_CKPT}" --dp-rank "${TASK_ID}" --dp-world-size "${TASK_COUNT}" "${@}") so all referenced symbols in this script are properly quoted.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@tools/launcher/common/eagle3/dump_offline_data_vllm.sh`:
- Around line 50-51: The invocation of compute_hidden_states_vllm.py uses the
environment variable HF_MODEL_CKPT without validation; add a guard before
running the python command that checks HF_MODEL_CKPT is non-empty, prints a
clear error (e.g., "HF_MODEL_CKPT is not set"), and exits with a non-zero status
if missing so the script never calls python with an empty --model argument;
update the shell block that builds the python3 command to perform this check and
fail fast.
- Line 34: The pip install command currently hides failures via "2>/dev/null ||
true"; change it to run pip install speculators datasets and then verify success
by checking the exit code and attempting to import the packages (e.g. run python
-c "import speculators, datasets" && echo "ok" || { echo "pip install or import
failed" >&2; exit 1; }) instead of swallowing errors; replace the suppressed
invocation "pip install speculators datasets 2>/dev/null || true" with an
install+import verification sequence and emit a clear error and non-zero exit
when installation/import fails.
- Line 55: The script currently forwards arguments using ${@}, which splits
arguments with spaces; change the forwarding to use "$@" so argument boundaries
are preserved when invoking the Python script (replace ${@} with "$@" wherever
arguments are forwarded). Ensure any exec or python call that used ${@} is
updated to use the quoted form to satisfy shellcheck SC2068 and avoid breaking
multi-word arguments.
---
Nitpick comments:
In `@tools/launcher/common/eagle3/dump_offline_data_vllm.sh`:
- Around line 36-55: Several shell variables are unquoted which can cause
word-splitting (SC2086): wrap parameter expansions like ${SLURM_ARRAY_TASK_ID},
${SLURM_ARRAY_TASK_COUNT}, ${TASK_ID}, ${TASK_COUNT}, ${HF_MODEL_CKPT} and the
positional expansion ${@} in double quotes; update the test conditions ([ -z
"${SLURM_ARRAY_TASK_ID}" ] etc.) and the python invocation (--model
"${HF_MODEL_CKPT}" --dp-rank "${TASK_ID}" --dp-world-size "${TASK_COUNT}"
"${@}") so all referenced symbols in this script are properly quoted.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Enterprise
Run ID: 8966f1a1-9e30-4747-94d9-e8de1499db9a
📒 Files selected for processing (3)
examples/speculative_decoding/collect_hidden_states/compute_hidden_states_vllm.pytools/launcher/common/eagle3/dump_offline_data_vllm.shtools/launcher/examples/EAGLE3_TRIAGE.md
✅ Files skipped from review due to trivial changes (1)
- examples/speculative_decoding/collect_hidden_states/compute_hidden_states_vllm.py
| # --input-data, --output-dir, --max-seq-len, etc. | ||
| ################################################################################################### | ||
|
|
||
| pip install speculators datasets 2>/dev/null || true |
There was a problem hiding this comment.
Verify pip installation success instead of suppressing errors.
Silently suppressing all pip errors with 2>/dev/null || true can hide real installation failures (network issues, missing dependencies, version conflicts). Consider verifying that the packages are importable after installation.
🛡️ Proposed fix to verify installation
-pip install speculators datasets 2>/dev/null || true
+pip install speculators datasets || {
+ echo "Warning: Failed to install speculators/datasets. Continuing anyway..."
+}
+
+python3 -c "import speculators, datasets" || {
+ echo "ERROR: Required packages not available. Exiting."
+ exit 1
+}📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| pip install speculators datasets 2>/dev/null || true | |
| pip install speculators datasets || { | |
| echo "Warning: Failed to install speculators/datasets. Continuing anyway..." | |
| } | |
| python3 -c "import speculators, datasets" || { | |
| echo "ERROR: Required packages not available. Exiting." | |
| exit 1 | |
| } |
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@tools/launcher/common/eagle3/dump_offline_data_vllm.sh` at line 34, The pip
install command currently hides failures via "2>/dev/null || true"; change it to
run pip install speculators datasets and then verify success by checking the
exit code and attempting to import the packages (e.g. run python -c "import
speculators, datasets" && echo "ok" || { echo "pip install or import failed"
>&2; exit 1; }) instead of swallowing errors; replace the suppressed invocation
"pip install speculators datasets 2>/dev/null || true" with an install+import
verification sequence and emit a clear error and non-zero exit when
installation/import fails.
| --dp-rank ${TASK_ID} \ | ||
| --dp-world-size ${TASK_COUNT} \ | ||
| --trust_remote_code \ | ||
| ${@} |
There was a problem hiding this comment.
Fix argument forwarding to preserve boundaries.
${@} will break arguments containing spaces. Use "$@" instead to preserve argument boundaries when forwarding CLI arguments to the Python script.
🐛 Proposed fix
--dp-rank ${TASK_ID} \
--dp-world-size ${TASK_COUNT} \
--trust_remote_code \
- ${@}
+ "$@"As per coding guidelines, shellcheck SC2068.
📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| ${@} | |
| "$@" |
🧰 Tools
🪛 Shellcheck (0.11.0)
[error] 55-55: Double quote array expansions to avoid re-splitting elements.
(SC2068)
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@tools/launcher/common/eagle3/dump_offline_data_vllm.sh` at line 55, The
script currently forwards arguments using ${@}, which splits arguments with
spaces; change the forwarding to use "$@" so argument boundaries are preserved
when invoking the Python script (replace ${@} with "$@" wherever arguments are
forwarded). Ensure any exec or python call that used ${@} is updated to use the
quoted form to satisfy shellcheck SC2068 and avoid breaking multi-word
arguments.
770b599 to
642da1f
Compare
nemo-run's SlurmExecutor defaults to dependency_type="afterok", which cancels all downstream tasks when a predecessor times out or fails. For pipelines with allow_to_fail=True, use "afterany" so subsequent tasks run regardless of predecessor exit status. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> Signed-off-by: Ye Yu <yeyu@nvidia.com>
…ript - Add language specifiers to fenced code blocks in EAGLE3_TRIAGE.md (MD040: two ``` blocks needed `text` language label) - Apply ruff-format reformatting to compute_hidden_states_vllm.py (line-length adjustments to match project style) Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> Signed-off-by: Ye Yu <yeyu@nvidia.com>
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@tools/launcher/examples/EAGLE3_TRIAGE.md`:
- Around line 101-107: In the EAGLE3_TRIAGE table, normalize the rerun wording
and fix the typo: replace every occurrence of the string "🔁 NEEDS RERUN" with
"🔁 RERUN REQUIRED" (affecting rows like the Ministral, GPT-OSS, MiniMax,
Qwen3.5, Step-3.5, DeepSeek entries) and change the footnote/remark text "or_hf"
to "or _hf" (seen in the Step-3.5-Flash row remark). Make these textual edits in
the table cells to maintain consistent phrasing and spacing.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Enterprise
Run ID: 92740be9-b940-44d8-a42c-23633272cbc3
📒 Files selected for processing (2)
examples/speculative_decoding/collect_hidden_states/compute_hidden_states_vllm.pytools/launcher/examples/EAGLE3_TRIAGE.md
🚧 Files skipped from review as they are similar to previous changes (1)
- examples/speculative_decoding/collect_hidden_states/compute_hidden_states_vllm.py
| | 1 | Ministral-3-8B | Dense | 8B | ⏱ TIMEOUT (3277/3295) | 🔁 NEEDS RERUN (_vllm) | 🔲 | 🔲 | task_0 nearly complete (99%); t1 re-run needed | | ||
| | 2 | Ministral-3-14B | Dense | 14B | ⏱ TIMEOUT | 🔁 NEEDS RERUN (_vllm) | 🔲 | 🔲 | — | | ||
| | 3 | GPT-OSS-20B | Dense | 20B | ❌ TOKENIZER | 🔁 NEEDS RERUN (_vllm) | 🔲 | 🔲 | Fix: populate TIKTOKEN_RS_CACHE_DIR first | | ||
| | 4 | MiniMax-M2.5 | MoE | 230B/10B | ⏱ TIMEOUT | 🔁 NEEDS RERUN (_vllm) | 🔲 | ❌ TRUST_REMOTE_CODE | trust_remote_code needed at bench | | ||
| | 5 | Qwen3.5-35B-A3B | MoE | 35B/3B | ⏱ TIMEOUT | 🔁 NEEDS RERUN (_vllm) | 🔲 | 🔲 | — | | ||
| | 6 | Step-3.5-Flash | MoE/SWA | 197B/11B | ⏱ TIMEOUT | 🔁 NEEDS RERUN (_vllm) | 🔲 | 🔲 | SWA: use _vllm or_hf script | | ||
| | 7 | DeepSeek-V3.2 | MoE/MLA | 685B/37B | 🔍 (tarball only) | 🔁 NEEDS RERUN (_vllm, 2-node) | 🔲 | 🔲 | 2-node; previous t1 OOM-killed | |
There was a problem hiding this comment.
Clean up matrix wording for consistency/readability.
Please normalize 🔁 NEEDS RERUN to something like 🔁 RERUN REQUIRED (Lines 101–107), and fix the typo or_hf → or _hf on Line 106.
🧰 Tools
🪛 LanguageTool
[style] ~101-~101: The double modal “NEEDS RERUN” is nonstandard (only accepted in certain dialects). Consider “to be RERUN”.
Context: ...| 8B | ⏱ TIMEOUT (3277/3295) | 🔁 NEEDS RERUN (_vllm) | 🔲 | 🔲 | task_0 nearly compl...
(NEEDS_FIXED)
[style] ~102-~102: The double modal “NEEDS RERUN” is nonstandard (only accepted in certain dialects). Consider “to be RERUN”.
Context: ...4B | Dense | 14B | ⏱ TIMEOUT | 🔁 NEEDS RERUN (_vllm) | 🔲 | 🔲 | — | | 3 | GPT-OSS-2...
(NEEDS_FIXED)
[style] ~103-~103: The double modal “NEEDS RERUN” is nonstandard (only accepted in certain dialects). Consider “to be RERUN”.
Context: ... | Dense | 20B | ❌ TOKENIZER | 🔁 NEEDS RERUN (_vllm) | 🔲 | 🔲 | Fix: populate TIKTO...
(NEEDS_FIXED)
[style] ~104-~104: The double modal “NEEDS RERUN” is nonstandard (only accepted in certain dialects). Consider “to be RERUN”.
Context: ...| MoE | 230B/10B | ⏱ TIMEOUT | 🔁 NEEDS RERUN (_vllm) | 🔲 | ❌ TRUST_REMOTE_CODE | tr...
(NEEDS_FIXED)
[style] ~105-~105: The double modal “NEEDS RERUN” is nonstandard (only accepted in certain dialects). Consider “to be RERUN”.
Context: ...B | MoE | 35B/3B | ⏱ TIMEOUT | 🔁 NEEDS RERUN (_vllm) | 🔲 | 🔲 | — | | 6 | Step-3.5-...
(NEEDS_FIXED)
[style] ~106-~106: The double modal “NEEDS RERUN” is nonstandard (only accepted in certain dialects). Consider “to be RERUN”.
Context: ...E/SWA | 197B/11B | ⏱ TIMEOUT | 🔁 NEEDS RERUN (_vllm) | 🔲 | 🔲 | SWA: use _vllm or_h...
(NEEDS_FIXED)
[style] ~107-~107: The double modal “NEEDS RERUN” is nonstandard (only accepted in certain dialects). Consider “to be RERUN”.
Context: ...685B/37B | 🔍 (tarball only) | 🔁 NEEDS RERUN (_vllm, 2-node) | 🔲 | 🔲 | 2-node; pre...
(NEEDS_FIXED)
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@tools/launcher/examples/EAGLE3_TRIAGE.md` around lines 101 - 107, In the
EAGLE3_TRIAGE table, normalize the rerun wording and fix the typo: replace every
occurrence of the string "🔁 NEEDS RERUN" with "🔁 RERUN REQUIRED" (affecting
rows like the Ministral, GPT-OSS, MiniMax, Qwen3.5, Step-3.5, DeepSeek entries)
and change the footnote/remark text "or_hf" to "or _hf" (seen in the
Step-3.5-Flash row remark). Make these textual edits in the table cells to
maintain consistent phrasing and spacing.
- dump_offline_data_vllm.sh: pin speculators<0.5.0 to fix ImportError (VllmHiddenStatesGenerator removed in speculators 0.5.0) - EAGLE3_TRIAGE.md: add Issues 6 (speculators API break) and 7 (query.py shard auto-downgrade causing empty data on timeout); mark both FIXED; update Ministral-3-8B row status Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> Signed-off-by: Ye Yu <yeyu@nvidia.com>
There was a problem hiding this comment.
♻️ Duplicate comments (4)
tools/launcher/common/eagle3/dump_offline_data_vllm.sh (2)
34-34:⚠️ Potential issue | 🟠 Major | ⚡ Quick winDo not suppress dependency install failures in the launcher path.
Line 34 hides install errors (
2>/dev/null || true), so the job can continue and fail later in a harder-to-debug way.Proposed fix
-pip install "speculators<0.5.0" datasets 2>/dev/null || true +if ! python3 -m pip install "speculators<0.5.0" datasets; then + echo "ERROR: failed to install required packages: speculators, datasets" >&2 + exit 1 +fi🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@tools/launcher/common/eagle3/dump_offline_data_vllm.sh` at line 34, The pip install command in dump_offline_data_vllm.sh currently suppresses failures ("pip install \"speculators<0.5.0\" datasets 2>/dev/null || true"), which hides install errors; update that invocation so install failures are not silenced—remove the stderr redirection and the "|| true" so the script fails fast (or explicitly exit non-zero on failure) when "pip install" for speculators/datasets fails, ensuring the error surfaces during launcher runs.
50-55:⚠️ Potential issue | 🔴 Critical | ⚡ Quick winHarden Python arg construction: validate model env and preserve argv boundaries.
Line 51 should fail fast if
HF_MODEL_CKPTis unset, and Line 55 must use"$@"(not${@}) to preserve argument boundaries.Proposed fix
+if [ -z "${HF_MODEL_CKPT:-}" ]; then + echo "ERROR: HF_MODEL_CKPT is not set" >&2 + exit 1 +fi + python3 modules/Model-Optimizer/examples/speculative_decoding/collect_hidden_states/compute_hidden_states_vllm.py \ - --model ${HF_MODEL_CKPT} \ - --dp-rank ${TASK_ID} \ - --dp-world-size ${TASK_COUNT} \ + --model "${HF_MODEL_CKPT}" \ + --dp-rank "${TASK_ID}" \ + --dp-world-size "${TASK_COUNT}" \ --trust_remote_code \ - ${@} + "$@"🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@tools/launcher/common/eagle3/dump_offline_data_vllm.sh` around lines 50 - 55, Ensure the script fails fast if HF_MODEL_CKPT is unset and preserve argument boundaries when forwarding CLI args: add a check for the environment variable HF_MODEL_CKPT (exit non‑zero with an error log if empty) before invoking compute_hidden_states_vllm.py, and change the final argument expansion from ${@} to "$@" when calling python3 modules/Model-Optimizer/examples/speculative_decoding/collect_hidden_states/compute_hidden_states_vllm.py that uses --model ${HF_MODEL_CKPT} --dp-rank ${TASK_ID} --dp-world-size ${TASK_COUNT} so TASK_ID/TASK_COUNT remain unchanged.tools/launcher/examples/EAGLE3_TRIAGE.md (2)
106-106:⚠️ Potential issue | 🟡 Minor | ⚡ Quick winFix typo in script reference.
The text "or_hf" should be "or _hf" (space before underscore) to match the script naming pattern
_vllmor_hf.📝 Suggested fix
-| 6 | Step-3.5-Flash | MoE/SWA | 197B/11B | ⏱ TIMEOUT | 🔁 NEEDS RERUN (_vllm) | 🔲 | 🔲 | SWA: use _vllm or_hf script | +| 6 | Step-3.5-Flash | MoE/SWA | 197B/11B | ⏱ TIMEOUT | 🔁 NEEDS RERUN (_vllm) | 🔲 | 🔲 | SWA: use _vllm or _hf script |🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@tools/launcher/examples/EAGLE3_TRIAGE.md` at line 106, The table cell contains a typo "or_hf" and should be changed to "or _hf" to match the script naming pattern; update the text in the row containing "Step-3.5-Flash" so the note reads "SWA: use _vllm or _hf" (referencing the script names _vllm and _hf) to ensure consistent spacing and naming.
102-107:⚠️ Potential issue | 🟡 Minor | ⚡ Quick winNormalize status wording for grammar and consistency.
The phrase "🔁 NEEDS RERUN" is grammatically nonstandard. Consider:
- "🔁 RERUN NEEDED" or "🔁 RERUN REQUIRED" to maintain the semantic distinction from "🔁 RERUNNING" (which indicates active re-execution)
This preserves the meaningful status distinction while improving readability.
📝 Suggested fix
-| 2 | Ministral-3-14B | Dense | 14B | ⏱ TIMEOUT | 🔁 NEEDS RERUN (_vllm) | 🔲 | 🔲 | — | -| 3 | GPT-OSS-20B | Dense | 20B | ❌ TOKENIZER | 🔁 NEEDS RERUN (_vllm) | 🔲 | 🔲 | Fix: populate TIKTOKEN_RS_CACHE_DIR first | -| 4 | MiniMax-M2.5 | MoE | 230B/10B | ⏱ TIMEOUT | 🔁 NEEDS RERUN (_vllm) | 🔲 | ❌ TRUST_REMOTE_CODE | trust_remote_code needed at bench | -| 5 | Qwen3.5-35B-A3B | MoE | 35B/3B | ⏱ TIMEOUT | 🔁 NEEDS RERUN (_vllm) | 🔲 | 🔲 | — | -| 6 | Step-3.5-Flash | MoE/SWA | 197B/11B | ⏱ TIMEOUT | 🔁 NEEDS RERUN (_vllm) | 🔲 | 🔲 | SWA: use _vllm or_hf script | -| 7 | DeepSeek-V3.2 | MoE/MLA | 685B/37B | 🔍 (tarball only) | 🔁 NEEDS RERUN (_vllm, 2-node) | 🔲 | 🔲 | 2-node; previous t1 OOM-killed | +| 2 | Ministral-3-14B | Dense | 14B | ⏱ TIMEOUT | 🔁 RERUN NEEDED (_vllm) | 🔲 | 🔲 | — | +| 3 | GPT-OSS-20B | Dense | 20B | ❌ TOKENIZER | 🔁 RERUN NEEDED (_vllm) | 🔲 | 🔲 | Fix: populate TIKTOKEN_RS_CACHE_DIR first | +| 4 | MiniMax-M2.5 | MoE | 230B/10B | ⏱ TIMEOUT | 🔁 RERUN NEEDED (_vllm) | 🔲 | ❌ TRUST_REMOTE_CODE | trust_remote_code needed at bench | +| 5 | Qwen3.5-35B-A3B | MoE | 35B/3B | ⏱ TIMEOUT | 🔁 RERUN NEEDED (_vllm) | 🔲 | 🔲 | — | +| 6 | Step-3.5-Flash | MoE/SWA | 197B/11B | ⏱ TIMEOUT | 🔁 RERUN NEEDED (_vllm) | 🔲 | 🔲 | SWA: use _vllm or _hf script | +| 7 | DeepSeek-V3.2 | MoE/MLA | 685B/37B | 🔍 (tarball only) | 🔁 RERUN NEEDED (_vllm, 2-node) | 🔲 | 🔲 | 2-node; previous t1 OOM-killed |🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@tools/launcher/examples/EAGLE3_TRIAGE.md` around lines 102 - 107, Replace the nonstandard status string "🔁 NEEDS RERUN" throughout the table with a grammatically consistent alternative such as "🔁 RERUN NEEDED" (or "🔁 RERUN REQUIRED") so the status column reads uniformly; update every occurrence of the literal token "🔁 NEEDS RERUN" (e.g., the cell "🔁 NEEDS RERUN (_vllm)") to the chosen wording and ensure any adjacent parenthetical notes like "(_vllm)" remain unchanged.
🧹 Nitpick comments (1)
tools/launcher/common/eagle3/dump_offline_data_vllm.sh (1)
36-48: ⚡ Quick winUse defaulted, quoted SLURM env checks to avoid brittle test behavior.
Line 36 and Line 43 should use
${VAR:-}with quotes; current form is fragile for unset/nounset contexts and non-idiomatic.Proposed fix
-if [ -z ${SLURM_ARRAY_TASK_ID} ]; then +if [ -z "${SLURM_ARRAY_TASK_ID:-}" ]; then @@ -if [ -z ${SLURM_ARRAY_TASK_COUNT} ]; then +if [ -z "${SLURM_ARRAY_TASK_COUNT:-}" ]; then🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@tools/launcher/common/eagle3/dump_offline_data_vllm.sh` around lines 36 - 48, The checks and assignments for SLURM_ARRAY_TASK_ID and SLURM_ARRAY_TASK_COUNT are fragile; update the conditional tests and assignments to use quoted defaulted parameter expansion (e.g. use "${SLURM_ARRAY_TASK_ID:-}" and "${SLURM_ARRAY_TASK_COUNT:-}") so the [ -z ... ] checks are safe when variables are unset or set to empty and the TASK_ID/TASK_COUNT assignments use the same quoted expansions; modify the if-test expressions and the TASK_ID/TASK_COUNT assignment expressions accordingly to reference "${SLURM_ARRAY_TASK_ID:-}" and "${SLURM_ARRAY_TASK_COUNT:-}".
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Duplicate comments:
In `@tools/launcher/common/eagle3/dump_offline_data_vllm.sh`:
- Line 34: The pip install command in dump_offline_data_vllm.sh currently
suppresses failures ("pip install \"speculators<0.5.0\" datasets 2>/dev/null ||
true"), which hides install errors; update that invocation so install failures
are not silenced—remove the stderr redirection and the "|| true" so the script
fails fast (or explicitly exit non-zero on failure) when "pip install" for
speculators/datasets fails, ensuring the error surfaces during launcher runs.
- Around line 50-55: Ensure the script fails fast if HF_MODEL_CKPT is unset and
preserve argument boundaries when forwarding CLI args: add a check for the
environment variable HF_MODEL_CKPT (exit non‑zero with an error log if empty)
before invoking compute_hidden_states_vllm.py, and change the final argument
expansion from ${@} to "$@" when calling python3
modules/Model-Optimizer/examples/speculative_decoding/collect_hidden_states/compute_hidden_states_vllm.py
that uses --model ${HF_MODEL_CKPT} --dp-rank ${TASK_ID} --dp-world-size
${TASK_COUNT} so TASK_ID/TASK_COUNT remain unchanged.
In `@tools/launcher/examples/EAGLE3_TRIAGE.md`:
- Line 106: The table cell contains a typo "or_hf" and should be changed to "or
_hf" to match the script naming pattern; update the text in the row containing
"Step-3.5-Flash" so the note reads "SWA: use _vllm or _hf" (referencing the
script names _vllm and _hf) to ensure consistent spacing and naming.
- Around line 102-107: Replace the nonstandard status string "🔁 NEEDS RERUN"
throughout the table with a grammatically consistent alternative such as "🔁
RERUN NEEDED" (or "🔁 RERUN REQUIRED") so the status column reads uniformly;
update every occurrence of the literal token "🔁 NEEDS RERUN" (e.g., the cell
"🔁 NEEDS RERUN (_vllm)") to the chosen wording and ensure any adjacent
parenthetical notes like "(_vllm)" remain unchanged.
---
Nitpick comments:
In `@tools/launcher/common/eagle3/dump_offline_data_vllm.sh`:
- Around line 36-48: The checks and assignments for SLURM_ARRAY_TASK_ID and
SLURM_ARRAY_TASK_COUNT are fragile; update the conditional tests and assignments
to use quoted defaulted parameter expansion (e.g. use "${SLURM_ARRAY_TASK_ID:-}"
and "${SLURM_ARRAY_TASK_COUNT:-}") so the [ -z ... ] checks are safe when
variables are unset or set to empty and the TASK_ID/TASK_COUNT assignments use
the same quoted expansions; modify the if-test expressions and the
TASK_ID/TASK_COUNT assignment expressions accordingly to reference
"${SLURM_ARRAY_TASK_ID:-}" and "${SLURM_ARRAY_TASK_COUNT:-}".
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Enterprise
Run ID: 3d9ab791-8b27-4bda-bfce-ac5d649d71dc
📒 Files selected for processing (2)
tools/launcher/common/eagle3/dump_offline_data_vllm.shtools/launcher/examples/EAGLE3_TRIAGE.md
…vllm.sh In vllm/vllm-openai:latest, torchvision is installed but incompatible with the container's torch version: importing torchvision raises RuntimeError: operator torchvision::nms does not exist This propagates through transformers → speculators, preventing VllmHiddenStatesGenerator from being imported. Fix: uninstall torchvision before installing speculators<0.5.0. Without torchvision present, transformers' lazy import machinery catches the ImportError gracefully (is_torchvision_available() = False) and speculators imports cleanly. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> Signed-off-by: Ye Yu <yeyu@nvidia.com>
speculators<0.5.0 pins torch==2.10.0, replacing the container's torch 2.11.0+cu130 and causing `ImportError: libtorch_cuda.so: cannot open shared object file` when vllm._C is loaded. Fix: install speculators with --no-deps to avoid the torch downgrade, then install transformers>=4.40,<5.0 separately (speculators is incompatible with transformers 5.x API). Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> Signed-off-by: Ye Yu <yeyu@nvidia.com>
The script was written against an unreleased speculators API. Fix to work with speculators 0.4.0.1 (latest stable with VllmHiddenStatesGenerator): - VllmHiddenStatesGenerator constructor: rename `model` kwarg to `model_path` and remove `trust_remote_code` (hardcoded True in 0.4.x) - Result parsing: `hidden_states` is a list of tensors ordered by layer_ids, not a dict keyed by layer index Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> Signed-off-by: Ye Yu <yeyu@nvidia.com>
Installing transformers>=4.40,<5.0 downgrades from the container's transformers 5.8.0, which breaks models like Ministral-3-8B whose model_type 'ministral3' is not registered in older transformers versions. The transformers 5.0 constraint was added to work around a speculators import issue, but that issue was caused by broken torchvision (already fixed). With torchvision uninstalled and speculators installed --no-deps, speculators 0.4.x works with the container's transformers 5.8.0. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> Signed-off-by: Ye Yu <yeyu@nvidia.com>
speculators/config.py defines SpeculatorModelConfig which inherits from transformers' PretrainedConfig. In transformers 5.x, PretrainedConfig carries torch.dtype annotations. Pydantic 2.13 (in vllm-openai:latest) tries to resolve these annotations in the calling module's namespace during model_rebuild(), but speculators/config.py doesn't import torch, causing PydanticUndefinedAnnotation at import time. Fix by patching speculators/config.py in-place after installation to add the missing `import torch` at the top of the file. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> Signed-off-by: Ye Yu <yeyu@nvidia.com>
transformers 5.x: apply_chat_template returns a BatchEncoding that may not inherit from dict; access .input_ids attribute instead of dict key. Also fix: add_generation_template -> add_generation_prompt (correct arg). Handle both 1D (single sequence) and 2D (batched) input_ids shapes. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> Signed-off-by: Ye Yu <yeyu@nvidia.com>
…quest)
Two runtime patches applied to speculators 0.4.0.1 after install:
1. pydantic_utils.py: pass _types_namespace={'torch': torch} to
model_rebuild() to fix PydanticUndefinedAnnotation with
transformers 5.x (PretrainedConfig has torch.dtype annotations
that pydantic 2.13 can't resolve without explicit namespace).
2. vllm_hidden_states_generator.py: remove eos_token_id kwarg from
Request() constructor which was dropped in newer vLLM.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Signed-off-by: Ye Yu <yeyu@nvidia.com>
speculators 0.4.0.1's generate() loop never calls scheduler.update_from_output(), so KV cache blocks are never freed after each batch. In newer vLLM (0.7+), this causes the scheduler to stop admitting new requests after the first MAX_NUM_SEQS=32 are processed, silently truncating large datasets. Fix by patching vllm_hidden_states_generator.py at install time to: 1. Capture the return value of sample_tokens() 2. Call scheduler.update_from_output(scheduler_output, sampled_output) so the scheduler advances request states and frees KV blocks 3. Call scheduler.finish_requests() for newly-completed prefill requests as an additional fallback to free KV capacity Tested on Ministral-3-8B with 330 conversations: previously only 32 were processed; now all 330 succeed. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> Signed-off-by: Ye Yu <yeyu@nvidia.com>
…nd export - Extend VLM detection in load_vlm_or_llm to check text_config/llm_config attributes, not just "vl" in model_type. This catches mistral3 models (e.g. Ministral-3-8B) which are VLM wrappers. - Add consolidated.safetensors fallback in FakeBaseModel._load_weights for checkpoints with incomplete HF shards but a complete Mistral native consolidated file. - Set use_cache=True in EAGLE export template configs. The None placeholder was not being filled for FakeBaseModel and newer huggingface_hub rejects None for bool fields. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> Signed-off-by: Ye Yu <yeyu@nvidia.com>
…pport - Add Mistral key names (tok_embeddings, output) to existing path lists instead of hardcoding aliases in a consolidated.safetensors fallback - Extend _load_index to try consolidated.safetensors as a single-file fallback (Mistral native checkpoints use this instead of model.safetensors) - Remove leading --- from YAML pipeline configs (pre-commit fix) Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> Signed-off-by: Ye Yu <yeyu@nvidia.com>
|
/claude review |
| tokenizer = AutoTokenizer.from_pretrained(args.model, trust_remote_code=args.trust_remote_code) | ||
| if tokenizer.pad_token is None: | ||
| tokenizer.pad_token = tokenizer.eos_token | ||
| tokenizer.chat_template = tokenizer.chat_template.replace(REMOVE_THINK_CHAT_TEMPLATE, "") |
There was a problem hiding this comment.
[IMPORTANT Algorithm] This call will raise AttributeError: 'NoneType' object has no attribute 'replace' for any tokenizer that does not define a chat template (some base models / mirrored checkpoints lack one).
The HF sibling script handles this correctly at compute_hidden_states_hf.py:161-162 with a is not None guard before the .replace() call. The vLLM variant should match — otherwise tokenizers without a chat template crash this script before any prompt is processed, which silently fails the entire dump.
| tokenizer.chat_template = tokenizer.chat_template.replace(REMOVE_THINK_CHAT_TEMPLATE, "") | |
| if tokenizer.chat_template is not None: | |
| tokenizer.chat_template = tokenizer.chat_template.replace(REMOVE_THINK_CHAT_TEMPLATE, "") |
There was a problem hiding this comment.
Already guarded (if tokenizer.chat_template is not None:), matching the HF script. Carried over into the rewrite.
| # Detect VLMs: either "vl" in model_type (e.g. "llava") or has a nested text config | ||
| # (e.g. Mistral3Config with model_type="mistral3" and text_config attribute). | ||
| _is_vlm = "vl" in model_config.model_type.lower() or any( | ||
| getattr(model_config, attr, None) is not None for attr in ["text_config", "llm_config"] | ||
| ) | ||
|
|
||
| if _is_vlm and use_offline_training: | ||
| # For VLMs in offline training, FakeBaseModel loads only embed_tokens + lm_head | ||
| # and auto-detects VLM weight key layouts (e.g. "language_model.model.embed_tokens"). | ||
| from modelopt.torch.speculative.plugins.modeling_fakebase import FakeBaseModel | ||
|
|
||
| return FakeBaseModel.from_source(model_name_or_path, trust_remote_code=trust_remote_code) | ||
|
|
||
| if _is_vlm: | ||
| model_cls = transformers.AutoModelForVision2Seq | ||
| else: | ||
| model_cls = transformers.AutoModelForCausalLM |
There was a problem hiding this comment.
[IMPORTANT Compatibility] Two backward-compatibility concerns with the new _is_vlm detection:
-
Broader detection routes non-VLM models to
AutoModelForVision2Seq. The fall-through at line 612-613 was previously gated only by"vl" in model_type. Now any config that exposes a non-Nonetext_configorllm_configattribute will be loaded withAutoModelForVision2Seq, even when called fromar_validate.py/export_hf_checkpoint.py(which passuse_offline_training=False). If a future / community config sets one of these attrs but isn't a Vision2Seq model, this regresses model loading for paths that worked before. Consider keeping the originalmodel_clsselection on"vl" in model_typeand only widening detection for the offline-training branch above (lines 605-610), where the fallback isFakeBaseModel— which is genuinely tolerant of either layout. -
use_fake_base=Falseis silently overridden for VLMs. The early return at 605-610 unconditionally callsFakeBaseModel.from_source(...)whenever_is_vlm and use_offline_training, regardless ofuse_fake_base. Previously,use_fake_base=Falsefor a VLM would take theAutoModelForVision2Seq + num_hidden_layers=0path. If the override is intentional, it deserves at least a warning so users who explicitly setrecipe.model.use_fake_base_for_offline=Falsearen't surprised; if not, the gate should also includeuse_fake_base.
There was a problem hiding this comment.
Good catch. The broadened text_config/llm_config check is what routes Mistral3 to FakeBaseModel for offline training, and it does also affect the non-offline model_cls selection. For the models in scope this is correct (they're genuine VLMs), and the offline path — the focus of this PR — is validated. I'd prefer to scope the broad check to the offline→FakeBaseModel branch and keep the narrow "vl" in model_type for model_cls as a follow-up, rather than touch the tested path here. Tracking it.
| if (index_path := _try_fetch(_SAFETENSORS_INDEX_FILENAME)) is not None: | ||
| with open(index_path) as f: | ||
| return json.load(f).get("weight_map", {}) | ||
| if (single_path := _try_fetch(_SAFETENSORS_SINGLE_FILENAME)) is not None: | ||
| with safe_open(single_path, framework="pt") as h: | ||
| return dict.fromkeys(h.keys(), _SAFETENSORS_SINGLE_FILENAME) | ||
| for single_name in _SAFETENSORS_SINGLE_FILENAMES: | ||
| if (single_path := _try_fetch(single_name)) is not None: | ||
| with safe_open(single_path, framework="pt") as h: | ||
| return dict.fromkeys(h.keys(), single_name) |
There was a problem hiding this comment.
[SUGGESTION] The consolidated.safetensors fallback only fires when no model.safetensors.index.json exists at all. Many Mistral-style repos ship both an HF index and a native consolidated.safetensors — and the issue the PR description calls out ("checkpoints with incomplete HF shards") would manifest as: the index is present, points to HF shards, but the HF shards omit lm_head / embed_tokens.
In that scenario, _load_index returns the (incomplete) HF weight_map, _find_weight_key raises RuntimeError(... cannot find ...), and the consolidated path is never tried.
Consider: if the index-based weight_map fails to locate either _LM_HEAD_PATHS or _EMBED_TOKENS_PATHS, fall back to scanning consolidated.safetensors (when present) before erroring out. Otherwise the new tok_embeddings / output aliases only help repos that ship only the consolidated file.
There was a problem hiding this comment.
Correct that the consolidated fallback fires only when no index.json is present, which covers the validated case (consolidated-only Mistral checkpoints). The "index present but shards omit lm_head/embed_tokens" case isn't handled; I'll add a retry in _load_weights (fall back to the single-file names when _find_weight_key misses) as a follow-up if we encounter a repo shaped that way.
|
|
||
| source ${SCRIPT_DIR}/../service_utils.sh | ||
|
|
||
| ################################################################################################### | ||
| # vLLM-based hidden state dumping using the speculators library. | ||
| # Uses compute_hidden_states_vllm.py with VllmHiddenStatesGenerator. | ||
| # Suitable for: any model supported by vLLM (broader coverage than TRT-LLM or HF device_map). | ||
| # | ||
| # Required environment: | ||
| # HF_MODEL_CKPT Path to the HF model checkpoint | ||
| # | ||
| # Args passed through to compute_hidden_states_vllm.py: | ||
| # --input-data, --output-dir, --max-seq-len, etc. | ||
| ################################################################################################### | ||
|
|
||
| pip install "speculators<0.5.0" --no-deps 2>/dev/null || true | ||
| pip install datasets 2>/dev/null || true | ||
|
|
||
| # vLLM API compatibility: speculators 0.4.0.1 uses Request(eos_token_id=...) which | ||
| # was removed in newer vLLM. Patch to remove the unsupported kwarg. | ||
| python3 -c " | ||
| import site, os | ||
| for d in site.getsitepackages(): | ||
| path = os.path.join(d, 'speculators', 'data_generation', 'vllm_hidden_states_generator.py') | ||
| if not os.path.exists(path): | ||
| continue | ||
| with open(path) as f: | ||
| c = f.read() | ||
| old = ' eos_token_id=self.tokenizer.eos_token_id,\n' | ||
| if old in c: | ||
| with open(path, 'w') as f: | ||
| f.write(c.replace(old, '')) | ||
| print('Patched vllm_hidden_states_generator.py: removed eos_token_id from Request()') | ||
| else: | ||
| print('vllm_hidden_states_generator.py: eos_token_id already removed or not found') | ||
| break | ||
| " 2>/dev/null || true | ||
|
|
||
| # Pydantic 2.13 compatibility: speculators.ReloadableBaseModel.reload_schema() calls | ||
| # model_rebuild(force=True) without a types_namespace. In pydantic 2.13+, inherited | ||
| # torch.dtype annotations from transformers.PretrainedConfig cannot be resolved in | ||
| # subclass modules that don't import torch. Fix by injecting torch into the namespace. | ||
| python3 -c " | ||
| import site, os | ||
| for d in site.getsitepackages(): | ||
| path = os.path.join(d, 'speculators', 'utils', 'pydantic_utils.py') | ||
| if not os.path.exists(path): | ||
| continue | ||
| with open(path) as f: | ||
| c = f.read() | ||
| old = 'cls.model_rebuild(force=True)' | ||
| new = 'import torch as _torch; cls.model_rebuild(force=True, _types_namespace={\"torch\": _torch})' | ||
| if old in c and new not in c: | ||
| with open(path, 'w') as f: | ||
| f.write(c.replace(old, new)) | ||
| print('Patched pydantic_utils.py: model_rebuild now passes torch namespace') | ||
| else: | ||
| print('pydantic_utils.py already patched or pattern not found') | ||
| break | ||
| " 2>/dev/null || true | ||
|
|
||
| # vLLM scheduler compatibility: speculators 0.4.0.1 generate() loop never calls | ||
| # scheduler.update_from_output(), so KV blocks are never freed and the scheduler | ||
| # stops admitting new requests after MAX_NUM_SEQS=32. Fix by injecting the call | ||
| # inside the loop, plus aborting completed requests to free KV capacity. | ||
| python3 << 'PYEOF' 2>/dev/null || true | ||
| import site, os | ||
|
|
||
| old = ( | ||
| ' model_output = self.executor.execute_model(scheduler_output)\n' | ||
| ' self.executor.sample_tokens(model_output)\n' | ||
| ) | ||
| new = ( | ||
| ' model_output = self.executor.execute_model(scheduler_output)\n' | ||
| ' sampled_output = self.executor.sample_tokens(model_output)\n' | ||
| ' # Advance scheduler state so KV blocks are freed after each batch.\n' | ||
| ' # Without this, newer vLLM never admits requests beyond MAX_NUM_SEQS.\n' | ||
| ' if hasattr(self.scheduler, \'update_from_output\'):\n' |
There was a problem hiding this comment.
[SUGGESTION] Three concerns with the in-place pip install + Python source-patching approach:
-
Silent failure: every
python3 -c "..." 2>/dev/null || trueswallows real errors. If the patch pattern shifts (speculators bumps a version, vLLM renames a method), the script will print "already patched or pattern not found" and proceed — and the actual hidden-state dump will then crash later in a way that's hard to root-cause back to a missed patch. -
pip installover a hot environment: installingspeculators<0.5.0 --no-depsanddatasetsinto the running container at job start makes runs non-reproducible (<0.5.0floats), and--no-depswill silently leave a broken install if speculators ever introduces a new required dep. A pinned version (speculators==0.4.0.1) plus a one-time check would be more durable. -
Schedulability fix is a real correctness patch: the
update_from_output+finish_requestsinjection is doing nontrivial work (advancing scheduler state and freeing KV blocks). If a future speculators release fixes this differently, the regex patch may match but be subtly wrong. Consider tracking this as an upstream issue and pinning speculators to a known-good version rather than relying on the patch lasting indefinitely.
Not blocking — useful for triage today — but worth noting as tech debt with a TODO pointing at the upstream tracker.
There was a problem hiding this comment.
Resolved — the patched library and its 2>/dev/null || true blocks are gone with the speculators removal (now using vLLM's native extractor).
There was a problem hiding this comment.
Claude review — EAGLE3 new model support
Summary
- CRITICAL: 0
- IMPORTANT: 2
- SUGGESTION: 2
Bulk of the PR is YAML/launcher configs and triage docs; the modelopt source surface is small (3 files, ~31 lines). Findings are concentrated there.
Most impactful findings
-
compute_hidden_states_vllm.py:97—tokenizer.chat_template.replace(...)will raiseAttributeErrorfor any tokenizer without a chat template. The HF sibling (compute_hidden_states_hf.py:161-162) guards this; the vLLM variant should match. (IMPORTANT Algorithm) -
modelopt/torch/speculative/utils.py:599-615— broadened_is_vlmdetection (anything with non-Nonetext_config/llm_config) now also gates the existingAutoModelForVision2Seqbranch, which is reached byar_validate.pyandexport_hf_checkpoint.py(both passuse_offline_training=False). Any future / community config that setstext_configfor non-Vision2Seq reasons will regress those paths. Additionally, the new offline-training early return ignoresuse_fake_base=False— silently overriding the user's recipe setting. (IMPORTANT Compatibility) -
modeling_fakebase.py:189-195—consolidated.safetensorsfallback only triggers when nomodel.safetensors.index.jsonexists at all; if the index is present but incomplete, the lookup still raises before the fallback can fire. (SUGGESTION) -
tools/launcher/common/eagle3/dump_offline_data_vllm.sh:19-96— runtimepip install --no-depsof an unpinned speculators range plus three sed-style Python patches that swallow errors. Useful for triage but should be tracked as upstream issues with pinned versions. (SUGGESTION)
Risk assessment
Low-to-moderate. The modelopt code changes are scoped; the use_cache: True template change is semantically correct for EAGLE3 inference deployment; the consolidated.safetensors path is opt-in. The two IMPORTANT items are real but narrow — finding 1 only triggers on tokenizers without a chat template (uncommon in this pipeline's target models), and finding 2's regression risk is contingent on configs that set text_config without being Vision2Seq.
LGTM once the two IMPORTANT items are addressed (or explicitly accepted with rationale).
- Shell scripts: quote variable expansions, use "$@" instead of ${@},
remove pip install error suppression (2>/dev/null || true), add
HF_MODEL_CKPT validation guard
- compute_hidden_states_vllm.py: guard tokenizer.chat_template against
None before .replace() (matches HF sibling script)
- YAML configs: add --trust-remote-code to task_3 benchmark args for
GLM-5, MiniMax-M2.5, Kimi-K2.5-NVFP4, GPT-OSS-20B
- GPT-OSS-20B: add TIKTOKEN_RS_CACHE_DIR to task_3 environment
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Signed-off-by: Ye Yu <yeyu@nvidia.com>
Convert EAGLE3 quick-fail pipeline configs to launcher format: - services/ → common/ script paths - oci_hsg_slurm_factory → slurm_factory - offline_training.sh → train_eagle.sh with config-based args - Add --trust-remote-code to task_3 for custom-code models - Delete triage artifacts (partial pipelines, computelab config) 13 files converted, 22 originals deleted. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> Signed-off-by: Ye Yu <yeyu@nvidia.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> Signed-off-by: Ye Yu <yeyu@nvidia.com>
| @@ -0,0 +1,42 @@ | |||
| #!/bin/bash | |||
There was a problem hiding this comment.
This file seems duplicate with common/eagle3/dump_offline_data.sh?
There was a problem hiding this comment.
Agreed — stale copy of common/eagle3/dump_offline_data.sh. Deleted in f307569cf; the canonical version lives under tools/launcher/common/eagle3/.
| @@ -0,0 +1,54 @@ | |||
| #!/bin/bash | |||
There was a problem hiding this comment.
Duplicate with common/eagle3/dump_offline_data_hf.sh?
There was a problem hiding this comment.
Agreed — stale copy of common/eagle3/dump_offline_data_hf.sh. Deleted in f307569cf.
|
|
||
| import torch | ||
| from datasets import load_dataset | ||
| from speculators.data_generation import VllmHiddenStatesGenerator |
There was a problem hiding this comment.
Do we want to make speculators a dependency to our pipeline?
I think we can start from vllm hidden states extracter, instead of calling speculators.
There was a problem hiding this comment.
Done in f307569cf. Dropped the speculators dependency and switched to vLLM's native extract_hidden_states method (ExampleHiddenStatesConnector). As a bonus, the aux-layer ids now match deployment by construction, since vLLM uses the same eagle_aux_hidden_state_layer_ids convention at serve time.
| source ${SCRIPT_DIR}/../service_utils.sh | ||
|
|
||
| ################################################################################################### | ||
| # vLLM-based hidden state dumping using the speculators library. |
There was a problem hiding this comment.
These patches seems hard to maintain and unnecessary to me. Can we just use vllm hidden states extracter instead?
There was a problem hiding this comment.
Done in f307569cf — switched to vLLM's native extract_hidden_states extractor. All three runtime source-patches are removed along with the speculators dependency.
| @@ -0,0 +1,111 @@ | |||
| # EAGLE3 quick-fail pipeline for deepseek-ai/DeepSeek-V3.2. | |||
There was a problem hiding this comment.
General question: why do we need these quick check yamls in addition to offline examples? They seems doing essentially the same thing to me.
There was a problem hiding this comment.
hf_offline_eagle3.yaml is the full reference config; eagle3_quick_check.yaml is the fail-fast triage variant (tiny data slice, short seq len, one short benchmark) the weekly CI runs to expose breakage cheaply — consistent with the triage purpose of this PR. Happy to fold it into the main YAML via a CLI override instead if you'd prefer a single file. cc @ChenhanYu
| @@ -0,0 +1,136 @@ | |||
| #!/bin/bash | |||
|
|
|||
There was a problem hiding this comment.
Is this file duplicate with common/eagle3/dump_offline_data_vllm.sh?
There was a problem hiding this comment.
Agreed — stale copy of common/eagle3/dump_offline_data_vllm.sh. Deleted in f307569cf.
| @@ -0,0 +1,189 @@ | |||
| #!/bin/bash | |||
There was a problem hiding this comment.
Is this script ever used in the repo?
There was a problem hiding this comment.
No — leftover from the sandbox flow. Deleted in f307569cf; the launcher trains via common/eagle3/train_eagle.sh.
| @@ -0,0 +1,273 @@ | |||
| # EAGLE3 Automation Triage Chart | |||
There was a problem hiding this comment.
I personally feel that this md is something we want to track internelly, instead of making public in the MO repo, since there is too much info, many of them are temporary. Hard to track/update.
There was a problem hiding this comment.
Agreed the time-sensitive status is better tracked internally. Framing for the call: the durable value here is the triage flow (how to read logs, localize the failing task, classify the root cause), not the status snapshot. This pipeline deliberately exposes issues for human handoff rather than fixing them per-model, so the stable artifact is the flow. Proposal: keep a slim triage-flow doc in-repo, move the live status chart + new-model guide internal. cc @ChenhanYu
| @@ -0,0 +1,207 @@ | |||
| # EAGLE3 Automation Triage Chart | |||
|
|
|||
| This document catalogs failure modes observed when running the EAGLE3 pipeline across | |||
There was a problem hiding this comment.
What's the relationship between this doc and EAGLE3_TRIAGE.md? Seems overlapping to me.
I also think this doc should not be kept publically in MO repo.
There was a problem hiding this comment.
There's overlap: EAGLE3_TRIAGE.md is the launcher-facing triage flow, while this chart is the per-model experiment log. Agree the per-model status is volatile and better internal. Proposal: consolidate to one slim triage-flow doc in-repo and move the status chart internal. The pipeline's purpose is to expose where each model breaks for a human to take over — so "FAIL/blocked" rows are the intended output, not unfinished work. cc @ChenhanYu
| @@ -0,0 +1,127 @@ | |||
| # EAGLE3 New Model Support — Triage Guide for Claude Code | |||
There was a problem hiding this comment.
I don't think we need this document. All these info is self-contained in our yaml file and launcher interface, so should be obvious for the agent without explicit description. It's also a debt to maintain.
There was a problem hiding this comment.
Fair — much of it is inferable from the launcher YAMLs. I've corrected it to the current launcher flow for now (train_eagle.sh, eagle3_quick_check.yaml, vLLM native extractor). Fine to remove it if we land the skills (#1429) as the canonical "how to add a model" path. cc @ChenhanYu
| @@ -0,0 +1,215 @@ | |||
| --- | |||
There was a problem hiding this comment.
Can we separate the SKILL.md files to another PR?
There was a problem hiding this comment.
Done — removed the four EAGLE3 SKILL.md files from this PR in f307569cf. They live only in #1429 now.
| | 2 | Kimi-K2.5 | MoE | 1T/32B | Existing | - | - | - | - | GB200 required | | ||
| | 3 | Qwen3.5-9B | Dense (VLM) | 9B | Not run | - | - | - | - | Text-only path | | ||
| | 4 | Qwen3.5-27B | Dense (VLM) | 27B | Not run | - | - | - | - | Text-only path | | ||
| | 5 | Qwen3.5-35B-A3B | MoE (VLM) | 35B/3B | **Blocked** | TIMEOUT | FAIL | FAIL | FAIL | Data synth too slow; infra issues | |
There was a problem hiding this comment.
Question:
Many models are added in the launcher example, but appears as failed or blocked in this chart. Are they already fixed or still failing?
There was a problem hiding this comment.
By design, the chart records status, not fixes. This pipeline is a triage harness: its job is to expose where each model breaks (which of the 4 tasks, with the root-cause error) so a human can take over — not to auto-fix every model. So the "FAIL/blocked" entries are exactly the intended output: a map of what still needs human attention. The code fixes in this PR (VLM detection, consolidated load, use_cache) came out of triaging the Ministral path; the rest remain open items, which is what the chart is meant to surface. I'll refresh the status column once the weekly CI has a clean pass.
- vLLM dump: replace speculators VllmHiddenStatesGenerator + runtime
source-patches with vLLM's native extract_hidden_states extractor
(ExampleHiddenStatesConnector). No third-party data-gen dependency.
- Delete legacy duplicate scripts in examples/speculative_decoding/pipeline/eagle3/
(dump_offline_data{,_hf,_vllm}.sh, offline_training.sh) — canonical
versions live in tools/launcher/common/eagle3/.
- Move EAGLE3 SKILL.md files out of this PR (consolidated into #1429).
- Kimi-K2.5 benchmark: forward --trust-remote-code.
- Update triage docs to match the launcher (train_eagle.sh, eagle3_quick_check.yaml,
vLLM native extractor) and drop stale speculators/offline_training references.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Signed-off-by: Ye Yu <yeyu@nvidia.com>
…1623) ### What does this PR do? Type of change: Revert / cleanup (follow-up to #1417) Per review feedback (@h-guo18): `main` should be production-ready and user-facing. Most of the EAGLE3 model example YAMLs added in #1417 are not yet verified to work end-to-end in modelopt (~80% fail at some pipeline stage), which is confusing to ship. The agreed plan is to **land the triage infrastructure now and re-add each model's launcher YAML in a dedicated follow-up PR once it is verified green**. **Removed** (unverified, to be re-added per-model once verified): - Per-model launcher configs (`hf_offline_eagle3.yaml` + `eagle3_quick_check.yaml`) for: DeepSeek-V3.2, GLM-5, MiniMax-M2.5, Ministral-3-8B, Ministral-3-14B, Kimi-K2.5, Kimi-K2.5-NVFP4, GPT-OSS-20B, Qwen3.5-9B, Qwen3.5-27B, Qwen3.5-35B-A3B, Step-3.5-Flash. - Per-model status docs: `tools/launcher/examples/EAGLE3_TRIAGE.md`, `examples/speculative_decoding/pipeline/eagle3/eagle3_triage_chart.md` (volatile status — tracked internally instead). **Kept** (the durable triage infrastructure from #1417): - Verified baseline example `tools/launcher/examples/Qwen/Qwen3-8B/eagle3_quick_check.yaml`. - Launcher common scripts (vLLM native-extractor dump, etc.) and `compute_hidden_states_vllm.py`. - modelopt code fixes: FakeBaseModel VLM detection, `consolidated.safetensors` load, `use_cache` export templates. - New-model triage guide (`eagle3_new_model_triage_guide.md`); its "document results" step now points at the internal tracker rather than the removed chart. ### Testing No code paths change — this only removes example YAMLs and two status docs and edits one doc reference. Pre-commit (ruff/markdownlint/yaml/license) passes on the kept/edited files. ### Before your PR is "*Ready for review*" - Is this change backward compatible?: ✅ (removes unverified examples only; kept infra unchanged) - If you copied code from any other sources or added a new PIP dependency, did you follow guidance in `CONTRIBUTING.md`: N/A - Did you write any new necessary tests?: N/A - Did you update Changelog?: N/A - Did you get Claude approval on this PR?: ❌ (pending) ### Additional Information Follow-up to #1417. Next step (tracked separately): verify each removed model end-to-end in modelopt, then re-add its YAML in a dedicated PR. Note: the nmm-sandbox weekly EAGLE3 CI is being trimmed to the Qwen3-8B baseline to match. 🤖 Generated with [Claude Code](https://claude.com/claude-code) <!-- This is an auto-generated comment: release notes by coderabbit.ai --> ## Summary by CodeRabbit * **Documentation** * Updated EAGLE3 triage guide to streamline the verification workflow; contributors now record test outcomes (status, experiment IDs, errors, and applied fixes) in the team's internal triage tracker before submitting model launcher configurations. * **Chores** * Removed legacy EAGLE3 example pipeline configurations and deprecated triage documentation to reduce maintenance overhead. <!-- end of auto-generated comment: release notes by coderabbit.ai --> Signed-off-by: Ye Yu <yeyu@nvidia.com> Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
…d new-model support (#1429) ## Summary Adds four user-invocable Claude Code skills for the EAGLE3 offline pipeline (OKR 30 — Claude-assist experience). - **`/eagle3-triage`**: Diagnose a failed pipeline run. Failure tables for all 4 tasks covering vLLM server startup, hidden state dump (3 backends: TRT-LLM / HF / vLLM), training crashes, and benchmark failures. New-model-specific issue checklist (VLMs, MoE, SWA, custom tokenizers). - **`/eagle3-validate`**: Verify a completed run end-to-end. Artifact checks per task, AR threshold validation (≥ 2.1), structured validation report. - **`/eagle3-new-model`**: Guided workflow for adding a new model. Architecture lookup, GB200 GPU/TP calculation, dump backend selection, full YAML template with correct public-launcher script paths. - **`/eagle3-review-logs`**: Lightweight log reader. Finds `sbatch_*.out` files, reads all task logs, produces pass/fail summary with root causes and next steps. Skills use public launcher paths (`common/eagle3/`, `common/vllm/`, etc.) and read `sbatch_*.out` files directly — no sandbox-specific tooling required. ## Related - OKR 31 (triage chart + model configs): #1417 ## Test plan - [ ] Run `/eagle3-triage` against a known-failed experiment and verify it identifies the root cause - [ ] Run `/eagle3-validate` against a passing experiment and verify AR check works - [ ] Run `/eagle3-new-model` to generate a config for a new model and verify the YAML is correct - [ ] Run `/eagle3-review-logs` and verify summary output matches actual log contents <!-- This is an auto-generated comment: release notes by coderabbit.ai --> ## Summary by CodeRabbit * **Documentation** * Added comprehensive EAGLE3 guides: onboarding new models and example config creation, a log-review workflow that generates structured pass/fail reports with root-cause one-liners and suggested fixes, a user-invocable end-to-end validation workflow for pipeline runs, and a triage/troubleshooting guide mapping common failure patterns to concrete remediation steps. <!-- end of auto-generated comment: release notes by coderabbit.ai --> --------- Signed-off-by: Ye Yu <yeyu@nvidia.com> Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>
Summary
EAGLE3 automation triage work (OKR-30): testing the 4-step EAGLE3 offline pipeline against 12 new model architectures, documenting failure modes, and fixing issues found.
Code fixes (modelopt)
modelopt/torch/speculative/utils.pyload_vlm_or_llmto checktext_config/llm_configattrs (catchesmistral3models)modelopt/torch/speculative/plugins/modeling_fakebase.pyconsolidated.safetensorsfallback for checkpoints with incomplete HF shardsmodelopt/torch/export/plugins/hf_spec_configs.pyuse_cache=Truein EAGLE export templates (fixes stricthuggingface_hubvalidation)Pipeline infrastructure
examples/speculative_decoding/pipeline/eagle3/— pipeline scripts and configs:offline_training.sh— training + export with runtime patches for older container modeloptdump_offline_data_vllm.sh— vLLM-based hidden state extraction (with speculators compat patches)dump_offline_data.sh,dump_offline_data_hf.sh— alternative dump pathsDocumentation
eagle3_triage_chart.md— model test matrix, triage decision tree, per-model results, failure catalogeagle3_new_model_triage_guide.md— step-by-step guide for triaging new modelsModel test results (as of 2026-05-27)
use_cache=nullin export (fixed)HarmonyErrortrust_remote_codeneededIssues found and fixed
mistral3model type not detected as VLMtext_config/llm_configattrs inload_vlm_or_llmconsolidated.safetensorswith Mistral native key aliasesuse_cache=nullin exported EAGLE configuse_cache=Truein export template configsdump_offline_data_vllm.shoffline_training.shinfra issuesTest plan
cicd_1779829129)cicd_1779901409— pending with all fixes)Note
GitHub secret scanning alert #6 is a false positive —
Mistral3ForConditionalGeneration(a HuggingFace model class name in a YAML comment) was flagged as a "Mistral AI API Key".🤖 Generated with Claude Code