Skip to content

Commit 770b599

Browse files
yeyu-nvidiaclaude
andcommitted
feat(okr30): add EAGLE3 Claude Code skills for triage, validation, and new-model support
Four user-invocable skills for the EAGLE3 offline pipeline: - eagle3-triage: diagnose failed pipeline runs step-by-step; failure tables for all 4 tasks (vLLM data synthesis, hidden state dump with 3 backends, training, benchmark); new-model-specific issue checklist - eagle3-validate: verify completed runs; artifact checks; AR threshold (>= 2.1); structured validation report with next-step guidance - eagle3-new-model: guided workflow for adding a new model; architecture lookup, GPU/TP calculation for GB200, backend selection, full YAML template with correct public-launcher script paths - eagle3-review-logs: lightweight log reader; finds sbatch .out files, reads all task logs, produces pass/fail summary with root causes Skills use public launcher paths (common/eagle3/, common/vllm/, etc.) and read sbatch .out files directly — no sandbox-specific tooling required. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> Signed-off-by: Ye Yu <yeyu@nvidia.com>
1 parent 642da1f commit 770b599

4 files changed

Lines changed: 609 additions & 0 deletions

File tree

Lines changed: 215 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,215 @@
1+
---
2+
name: eagle3-new-model
3+
description: >
4+
Add a new model to the EAGLE3 offline pipeline. Generates an hf_offline_eagle3.yaml
5+
launcher config for a new model checkpoint, choosing the right hidden state dump
6+
backend (TRT-LLM / HF / vLLM) and GPU configuration.
7+
Use when user wants to run EAGLE3 on a model that does not yet have a YAML in
8+
tools/launcher/examples/ or asks how to configure the pipeline for a new checkpoint.
9+
---
10+
11+
# EAGLE3 New Model Configuration
12+
13+
This skill guides you through creating `tools/launcher/examples/<Org>/<Model>/hf_offline_eagle3.yaml`
14+
for a new model.
15+
16+
## Step 1 — Look up the model architecture
17+
18+
Determine these values from the HuggingFace model card, `config.json`, and vLLM docs:
19+
20+
| Property | Where to find it |
21+
|---|---|
22+
| Total / active parameters | Model card |
23+
| Dense or MoE? | `config.json``num_experts`, `num_experts_per_tok` |
24+
| Attention type (MHA / GQA / MLA / SWA) | Model card |
25+
| Multimodal? (vision encoder) | Model card |
26+
| BF16 weight size (GB) | `total_params × 2 bytes` |
27+
| Special serving flags | vLLM docs, model README (`--trust-remote-code`, parsers) |
28+
29+
## Step 2 — Calculate GPU requirements (OCI-HSG / GB200)
30+
31+
OCI-HSG nodes: **4 GPUs × 192 GB HBM3e = 768 GB per node**
32+
33+
```
34+
BF16 weight size = total_params × 2 bytes
35+
GPUs needed = ceil(weight_size_GB / 192)
36+
nodes = ceil(gpus_needed / 4)
37+
tp = min(gpus_needed, 4)
38+
```
39+
40+
| Model | Weights (BF16) | GPUs | nodes | tp |
41+
|---|---|---|---|---|
42+
| 8B dense | ~16 GB | 1 | 1 | 4 |
43+
| 70B dense | ~140 GB | 1 | 1 | 4 |
44+
| 685B MoE | ~340 GB | 2 | 1 | 4 |
45+
| 1T MoE | ~595 GB | 4 | 1 | 4 |
46+
47+
## Step 3 — Choose the hidden state dump backend
48+
49+
| Backend | Script | When to use |
50+
|---------|--------|-------------|
51+
| vLLM | `common/eagle3/dump_offline_data_vllm.sh` | Default; broad coverage via vLLM + speculators |
52+
| HF | `common/eagle3/dump_offline_data_hf.sh` | VLMs, custom-code models, SWA attention |
53+
| TRT-LLM | `common/eagle3/dump_offline_data.sh` | Pure-text models with TRT-LLM support (needs `--tp`/`--moe-ep`) |
54+
55+
Use **HF** when the model is a VLM or uses sliding window attention (TRT-LLM does not support these).
56+
Use **vLLM** for everything else as the default.
57+
58+
## Step 4 — Write the YAML
59+
60+
Create `tools/launcher/examples/<Org>/<Model>/hf_offline_eagle3.yaml`.
61+
Use an existing config as a reference (e.g., `tools/launcher/examples/Qwen/Qwen3.5-35B-A3B/hf_offline_eagle3.yaml`).
62+
63+
### Header comment
64+
65+
```yaml
66+
# EAGLE3 offline speculative decoding pipeline for <org>/<model>.
67+
#
68+
# <Model> is a <size> <dense|MoE> model. <brief notes: attention type, special reqs>
69+
# BF16 weights ~<size> GB — fits on <N> GB200 node(s) (<N> × 192 GB).
70+
#
71+
# <Special requirements (if any)>
72+
#
73+
# 4-step pipeline:
74+
# task_0: Data synthesis — query vLLM server to generate prompt samples
75+
# task_1: Dump hidden states — run target model to capture hidden states
76+
# task_2: Offline training — train the EAGLE3 draft head
77+
# task_3: Benchmark — evaluate speculative decoding speedup via VLLM
78+
#
79+
# Usage:
80+
# uv run launch.py --yaml examples/<Org>/<Model>/hf_offline_eagle3.yaml --yes
81+
# uv run slurm.py --yaml modules/Model-Optimizer/tools/launcher/examples/<Org>/<Model>/hf_offline_eagle3.yaml --yes
82+
83+
job_name: <Model>_EAGLE3_offline
84+
pipeline:
85+
allow_to_fail: false
86+
skip: false
87+
note:
88+
89+
global_vars:
90+
hf_model: /hf-local/<org>/<model>
91+
```
92+
93+
### task_0 — Data synthesis (`common/vllm/query.sh`)
94+
95+
Args before `--` go to the vLLM server; args after `--` go to `query.py`.
96+
97+
```yaml
98+
task_0:
99+
script: common/vllm/query.sh
100+
args:
101+
- --model <<global_vars.hf_model>>
102+
- --tensor-parallel-size <TP>
103+
- --trust-remote-code # add only if required
104+
- -- # separator
105+
- --data /hf-local/modelopt/Speculative-Decoding-Dataset-v2-default
106+
- --save /scratchspace/data
107+
environment:
108+
- HF_LOCAL: /hf-local
109+
slurm_config:
110+
_factory_: "slurm_factory"
111+
nodes: <nodes>
112+
ntasks_per_node: 1
113+
gpus_per_node: 4
114+
container: vllm/vllm-openai:latest
115+
```
116+
117+
### task_1 — Hidden states (vLLM backend, default)
118+
119+
```yaml
120+
task_1:
121+
script: common/eagle3/dump_offline_data_vllm.sh
122+
args:
123+
- --input-data /scratchspace/data
124+
- --output-dir /scratchspace/offline_hidden_states
125+
- --max-seq-len 8192
126+
environment:
127+
- HF_MODEL_CKPT: <<global_vars.hf_model>>
128+
slurm_config:
129+
_factory_: "slurm_factory"
130+
nodes: <nodes>
131+
ntasks_per_node: 1
132+
gpus_per_node: 4
133+
container: vllm/vllm-openai:latest
134+
```
135+
136+
For **HF backend** (VLMs, SWA models), use `dump_offline_data_hf.sh` instead — same args, no TP flags needed.
137+
138+
For **TRT-LLM backend**, use `dump_offline_data.sh` and add `--tp <TP>` and `--moe-ep 1` (or appropriate EP).
139+
140+
### task_2 — Offline training (`common/eagle3/train_eagle.sh`)
141+
142+
```yaml
143+
task_2:
144+
script: common/eagle3/train_eagle.sh
145+
args:
146+
- --config modules/Model-Optimizer/modelopt_recipes/general/speculative_decoding/eagle3.yaml
147+
- model.model_name_or_path=<<global_vars.hf_model>>
148+
- data.offline_data_path=/scratchspace/offline_hidden_states
149+
- training.output_dir=/scratchspace/eagle3
150+
- training.training_seq_len=4096
151+
- training.disable_tqdm=true
152+
- training.ar_validate_steps=500000
153+
slurm_config:
154+
_factory_: "slurm_factory"
155+
nodes: 1
156+
ntasks_per_node: 1
157+
gpus_per_node: 4
158+
container: nvcr.io/nvidia/tensorrt-llm/release:1.3.0rc10
159+
```
160+
161+
> **MoE note:** For MoE models with large per-expert hidden dims, consider increasing
162+
> `intermediate_size` in `eagle_config.json` to match the model's `moe_intermediate_size`.
163+
164+
### task_3 — Benchmark (`common/specdec_bench/quick_check.sh`)
165+
166+
```yaml
167+
task_3:
168+
script: common/specdec_bench/quick_check.sh
169+
args:
170+
- --draft_model_dir /scratchspace/export
171+
- --draft_length 3
172+
- --output_length 4096
173+
- --engine VLLM
174+
- --tp_size <TP>
175+
- --ep_size 1
176+
- --speculative_algorithm EAGLE3
177+
- --mtbench /hf-local/HuggingFaceH4/mt_bench_prompts/raw/question.jsonl
178+
- --concurrency 1
179+
environment:
180+
- HF_LOCAL: /hf-local
181+
- HF_MODEL_CKPT: <<global_vars.hf_model>>
182+
slurm_config:
183+
_factory_: "slurm_factory"
184+
nodes: <nodes>
185+
ntasks_per_node: 1
186+
gpus_per_node: 4
187+
container: vllm/vllm-openai:latest
188+
```
189+
190+
## Step 5 — Common model-specific adjustments
191+
192+
| Situation | What to change |
193+
|---|---|
194+
| Requires `--trust-remote-code` | Add to task_0 vLLM args (before `--`) |
195+
| VLM / multimodal | Use `dump_offline_data_hf.sh` for task_1 |
196+
| Sliding window attention | Use `dump_offline_data_hf.sh` or `_vllm.sh` for task_1 |
197+
| MoE with large expert hidden dim | Increase `intermediate_size` in eagle_config.json |
198+
| Non-standard attention (MLA) | Verify `eagle_decoder_type` in the eagle3 recipe YAML |
199+
| Custom tokenizer (e.g., tiktoken) | Set `TIKTOKEN_RS_CACHE_DIR` env var in task_0 and task_1 |
200+
| NVFP4 quant model | task_0/task_3 use quant container; task_1/task_2 use BF16 base model — add `hf_model_bf16` global_var |
201+
| Model needs `trust_remote_code` at benchmark | Add `--trust-remote-code` to task_3 args |
202+
203+
## Step 6 — Test with dry run
204+
205+
Preview the resolved config before submitting:
206+
207+
```bash
208+
uv run launch.py --yaml examples/<Org>/<Model>/hf_offline_eagle3.yaml --dryrun --yes -v
209+
```
210+
211+
## Step 7 — Update triage chart
212+
213+
After adding a new model, add a row to the test matrix in
214+
`tools/launcher/examples/EAGLE3_TRIAGE.md` with status 🔲 (not yet tested).
215+
Fill in results after running.
Lines changed: 96 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,96 @@
1+
---
2+
name: eagle3-review-logs
3+
description: >
4+
Review EAGLE3 pipeline experiment logs from the launcher's experiments/ directory.
5+
Summarizes pass/fail status for all 4 tasks, diagnoses failures with root causes
6+
and fixes, and flags warnings. Use when the user asks to review job logs,
7+
check experiment results, or diagnose why a specific task failed.
8+
user_invocable: true
9+
---
10+
11+
# Review EAGLE3 Experiment Logs
12+
13+
Analyze output logs from an EAGLE3 pipeline run launched via `launch.py` or `slurm.py`.
14+
15+
## Step 0 — Find experiment logs
16+
17+
Locate the experiment directory. The default is `experiments/` relative to the launcher root,
18+
or wherever `--job-dir` was pointed.
19+
20+
```bash
21+
ls -td experiments/cicd/cicd_* | head -10
22+
```
23+
24+
Each experiment has one subdirectory per task (0–3). Logs are `sbatch_*.out` files inside:
25+
26+
```bash
27+
find experiments/<exp_id>/ -name "sbatch_*.out" | sort
28+
```
29+
30+
Do this in a single Bash call. If no experiments exist, ask the user for the directory.
31+
32+
## Step 1 — Read all task logs
33+
34+
Read the last 200 lines of each log in parallel. Errors appear at the end:
35+
36+
```bash
37+
for f in $(find experiments/<exp_id>/ -name "sbatch_*.out" | sort); do
38+
echo "=== $f ==="; tail -200 "$f"; echo
39+
done
40+
```
41+
42+
## Step 2 — Analyze
43+
44+
For each task log, check:
45+
46+
- **Exit / cancellation**: `DUE TO TIME LIMIT`, `FAILED`, signal (e.g., `signal 15`)
47+
- **Python exceptions / tracebacks**: last exception is usually the root cause
48+
- **CUDA errors**: OOM, NCCL timeout
49+
- **Slurm state**: COMPLETED, FAILED, TIMEOUT, OUT_OF_MEMORY
50+
- **Success indicators**: "Saved N samples", "Successfully processed N conversations", training loss line, AR output
51+
52+
## Step 3 — Produce report
53+
54+
Output a structured markdown report:
55+
56+
### Summary
57+
- Overall status: PASSED / FAILED / MIXED / PARTIAL
58+
- Task breakdown: e.g., task_0 TIMEOUT, task_1 FAIL, task_2 skipped, task_3 skipped
59+
60+
### Task Results
61+
62+
For each task (0–3):
63+
64+
**Task N — \<name\>: PASS / FAIL / TIMEOUT**
65+
- Key output: (e.g., "3277/3295 samples generated" or "Script not found")
66+
- Error (if failed): quoted error message, max 10 lines
67+
- Root cause: one-line diagnosis
68+
- Suggested fix: actionable step
69+
70+
### Warnings
71+
Non-fatal issues worth noting (near-OOM, tokenizer warnings, slow throughput).
72+
73+
## Step 4 — Suggest next steps
74+
75+
Based on results:
76+
77+
- If a task failed due to a known issue, suggest the fix and how to re-run from that task:
78+
```bash
79+
uv run launch.py --yaml examples/<Org>/<Model>/hf_offline_eagle3.yaml \
80+
pipeline.task_0.skip=true \
81+
--yes
82+
```
83+
84+
- If the failure pattern is new (not in `tools/launcher/examples/EAGLE3_TRIAGE.md`),
85+
suggest adding it to the triage chart using `/eagle3-triage` guidance.
86+
87+
- If all tasks passed, suggest running `/eagle3-validate` to confirm AR meets threshold.
88+
89+
## Known benign patterns (do NOT mark as failures)
90+
91+
| Pattern | Explanation |
92+
|---|---|
93+
| vLLM server exit code 143 | SIGTERM — server was killed after queries completed. Expected. |
94+
| `CANCELLED AT ... DUE TO TASK FAILURE` after `exit code: 0` | Slurm cleanup of worker nodes after main task succeeded. |
95+
| `destroy_process_group() was not called` | Benign PyTorch shutdown warning. |
96+
| `tokenizer class ... not equal to the registered tokenizer class` | Harmless tokenizer mismatch warning. |

0 commit comments

Comments
 (0)