|
| 1 | +--- |
| 2 | +name: eagle3-new-model |
| 3 | +description: > |
| 4 | + Add a new model to the EAGLE3 offline pipeline. Generates an hf_offline_eagle3.yaml |
| 5 | + launcher config for a new model checkpoint, choosing the right hidden state dump |
| 6 | + backend (TRT-LLM / HF / vLLM) and GPU configuration. |
| 7 | + Use when user wants to run EAGLE3 on a model that does not yet have a YAML in |
| 8 | + tools/launcher/examples/ or asks how to configure the pipeline for a new checkpoint. |
| 9 | +--- |
| 10 | + |
| 11 | +# EAGLE3 New Model Configuration |
| 12 | + |
| 13 | +This skill guides you through creating `tools/launcher/examples/<Org>/<Model>/hf_offline_eagle3.yaml` |
| 14 | +for a new model. |
| 15 | + |
| 16 | +## Step 1 — Look up the model architecture |
| 17 | + |
| 18 | +Determine these values from the HuggingFace model card, `config.json`, and vLLM docs: |
| 19 | + |
| 20 | +| Property | Where to find it | |
| 21 | +|---|---| |
| 22 | +| Total / active parameters | Model card | |
| 23 | +| Dense or MoE? | `config.json` → `num_experts`, `num_experts_per_tok` | |
| 24 | +| Attention type (MHA / GQA / MLA / SWA) | Model card | |
| 25 | +| Multimodal? (vision encoder) | Model card | |
| 26 | +| BF16 weight size (GB) | `total_params × 2 bytes` | |
| 27 | +| Special serving flags | vLLM docs, model README (`--trust-remote-code`, parsers) | |
| 28 | + |
| 29 | +## Step 2 — Calculate GPU requirements (OCI-HSG / GB200) |
| 30 | + |
| 31 | +OCI-HSG nodes: **4 GPUs × 192 GB HBM3e = 768 GB per node** |
| 32 | + |
| 33 | +``` |
| 34 | +BF16 weight size = total_params × 2 bytes |
| 35 | +GPUs needed = ceil(weight_size_GB / 192) |
| 36 | +nodes = ceil(gpus_needed / 4) |
| 37 | +tp = min(gpus_needed, 4) |
| 38 | +``` |
| 39 | + |
| 40 | +| Model | Weights (BF16) | GPUs | nodes | tp | |
| 41 | +|---|---|---|---|---| |
| 42 | +| 8B dense | ~16 GB | 1 | 1 | 4 | |
| 43 | +| 70B dense | ~140 GB | 1 | 1 | 4 | |
| 44 | +| 685B MoE | ~340 GB | 2 | 1 | 4 | |
| 45 | +| 1T MoE | ~595 GB | 4 | 1 | 4 | |
| 46 | + |
| 47 | +## Step 3 — Choose the hidden state dump backend |
| 48 | + |
| 49 | +| Backend | Script | When to use | |
| 50 | +|---------|--------|-------------| |
| 51 | +| vLLM | `common/eagle3/dump_offline_data_vllm.sh` | Default; broad coverage via vLLM + speculators | |
| 52 | +| HF | `common/eagle3/dump_offline_data_hf.sh` | VLMs, custom-code models, SWA attention | |
| 53 | +| TRT-LLM | `common/eagle3/dump_offline_data.sh` | Pure-text models with TRT-LLM support (needs `--tp`/`--moe-ep`) | |
| 54 | + |
| 55 | +Use **HF** when the model is a VLM or uses sliding window attention (TRT-LLM does not support these). |
| 56 | +Use **vLLM** for everything else as the default. |
| 57 | + |
| 58 | +## Step 4 — Write the YAML |
| 59 | + |
| 60 | +Create `tools/launcher/examples/<Org>/<Model>/hf_offline_eagle3.yaml`. |
| 61 | +Use an existing config as a reference (e.g., `tools/launcher/examples/Qwen/Qwen3.5-35B-A3B/hf_offline_eagle3.yaml`). |
| 62 | + |
| 63 | +### Header comment |
| 64 | + |
| 65 | +```yaml |
| 66 | +# EAGLE3 offline speculative decoding pipeline for <org>/<model>. |
| 67 | +# |
| 68 | +# <Model> is a <size> <dense|MoE> model. <brief notes: attention type, special reqs> |
| 69 | +# BF16 weights ~<size> GB — fits on <N> GB200 node(s) (<N> × 192 GB). |
| 70 | +# |
| 71 | +# <Special requirements (if any)> |
| 72 | +# |
| 73 | +# 4-step pipeline: |
| 74 | +# task_0: Data synthesis — query vLLM server to generate prompt samples |
| 75 | +# task_1: Dump hidden states — run target model to capture hidden states |
| 76 | +# task_2: Offline training — train the EAGLE3 draft head |
| 77 | +# task_3: Benchmark — evaluate speculative decoding speedup via VLLM |
| 78 | +# |
| 79 | +# Usage: |
| 80 | +# uv run launch.py --yaml examples/<Org>/<Model>/hf_offline_eagle3.yaml --yes |
| 81 | +# uv run slurm.py --yaml modules/Model-Optimizer/tools/launcher/examples/<Org>/<Model>/hf_offline_eagle3.yaml --yes |
| 82 | + |
| 83 | +job_name: <Model>_EAGLE3_offline |
| 84 | +pipeline: |
| 85 | + allow_to_fail: false |
| 86 | + skip: false |
| 87 | + note: |
| 88 | + |
| 89 | + global_vars: |
| 90 | + hf_model: /hf-local/<org>/<model> |
| 91 | +``` |
| 92 | +
|
| 93 | +### task_0 — Data synthesis (`common/vllm/query.sh`) |
| 94 | + |
| 95 | +Args before `--` go to the vLLM server; args after `--` go to `query.py`. |
| 96 | + |
| 97 | +```yaml |
| 98 | + task_0: |
| 99 | + script: common/vllm/query.sh |
| 100 | + args: |
| 101 | + - --model <<global_vars.hf_model>> |
| 102 | + - --tensor-parallel-size <TP> |
| 103 | + - --trust-remote-code # add only if required |
| 104 | + - -- # separator |
| 105 | + - --data /hf-local/modelopt/Speculative-Decoding-Dataset-v2-default |
| 106 | + - --save /scratchspace/data |
| 107 | + environment: |
| 108 | + - HF_LOCAL: /hf-local |
| 109 | + slurm_config: |
| 110 | + _factory_: "slurm_factory" |
| 111 | + nodes: <nodes> |
| 112 | + ntasks_per_node: 1 |
| 113 | + gpus_per_node: 4 |
| 114 | + container: vllm/vllm-openai:latest |
| 115 | +``` |
| 116 | + |
| 117 | +### task_1 — Hidden states (vLLM backend, default) |
| 118 | + |
| 119 | +```yaml |
| 120 | + task_1: |
| 121 | + script: common/eagle3/dump_offline_data_vllm.sh |
| 122 | + args: |
| 123 | + - --input-data /scratchspace/data |
| 124 | + - --output-dir /scratchspace/offline_hidden_states |
| 125 | + - --max-seq-len 8192 |
| 126 | + environment: |
| 127 | + - HF_MODEL_CKPT: <<global_vars.hf_model>> |
| 128 | + slurm_config: |
| 129 | + _factory_: "slurm_factory" |
| 130 | + nodes: <nodes> |
| 131 | + ntasks_per_node: 1 |
| 132 | + gpus_per_node: 4 |
| 133 | + container: vllm/vllm-openai:latest |
| 134 | +``` |
| 135 | + |
| 136 | +For **HF backend** (VLMs, SWA models), use `dump_offline_data_hf.sh` instead — same args, no TP flags needed. |
| 137 | + |
| 138 | +For **TRT-LLM backend**, use `dump_offline_data.sh` and add `--tp <TP>` and `--moe-ep 1` (or appropriate EP). |
| 139 | + |
| 140 | +### task_2 — Offline training (`common/eagle3/train_eagle.sh`) |
| 141 | + |
| 142 | +```yaml |
| 143 | + task_2: |
| 144 | + script: common/eagle3/train_eagle.sh |
| 145 | + args: |
| 146 | + - --config modules/Model-Optimizer/modelopt_recipes/general/speculative_decoding/eagle3.yaml |
| 147 | + - model.model_name_or_path=<<global_vars.hf_model>> |
| 148 | + - data.offline_data_path=/scratchspace/offline_hidden_states |
| 149 | + - training.output_dir=/scratchspace/eagle3 |
| 150 | + - training.training_seq_len=4096 |
| 151 | + - training.disable_tqdm=true |
| 152 | + - training.ar_validate_steps=500000 |
| 153 | + slurm_config: |
| 154 | + _factory_: "slurm_factory" |
| 155 | + nodes: 1 |
| 156 | + ntasks_per_node: 1 |
| 157 | + gpus_per_node: 4 |
| 158 | + container: nvcr.io/nvidia/tensorrt-llm/release:1.3.0rc10 |
| 159 | +``` |
| 160 | + |
| 161 | +> **MoE note:** For MoE models with large per-expert hidden dims, consider increasing |
| 162 | +> `intermediate_size` in `eagle_config.json` to match the model's `moe_intermediate_size`. |
| 163 | + |
| 164 | +### task_3 — Benchmark (`common/specdec_bench/quick_check.sh`) |
| 165 | + |
| 166 | +```yaml |
| 167 | + task_3: |
| 168 | + script: common/specdec_bench/quick_check.sh |
| 169 | + args: |
| 170 | + - --draft_model_dir /scratchspace/export |
| 171 | + - --draft_length 3 |
| 172 | + - --output_length 4096 |
| 173 | + - --engine VLLM |
| 174 | + - --tp_size <TP> |
| 175 | + - --ep_size 1 |
| 176 | + - --speculative_algorithm EAGLE3 |
| 177 | + - --mtbench /hf-local/HuggingFaceH4/mt_bench_prompts/raw/question.jsonl |
| 178 | + - --concurrency 1 |
| 179 | + environment: |
| 180 | + - HF_LOCAL: /hf-local |
| 181 | + - HF_MODEL_CKPT: <<global_vars.hf_model>> |
| 182 | + slurm_config: |
| 183 | + _factory_: "slurm_factory" |
| 184 | + nodes: <nodes> |
| 185 | + ntasks_per_node: 1 |
| 186 | + gpus_per_node: 4 |
| 187 | + container: vllm/vllm-openai:latest |
| 188 | +``` |
| 189 | + |
| 190 | +## Step 5 — Common model-specific adjustments |
| 191 | + |
| 192 | +| Situation | What to change | |
| 193 | +|---|---| |
| 194 | +| Requires `--trust-remote-code` | Add to task_0 vLLM args (before `--`) | |
| 195 | +| VLM / multimodal | Use `dump_offline_data_hf.sh` for task_1 | |
| 196 | +| Sliding window attention | Use `dump_offline_data_hf.sh` or `_vllm.sh` for task_1 | |
| 197 | +| MoE with large expert hidden dim | Increase `intermediate_size` in eagle_config.json | |
| 198 | +| Non-standard attention (MLA) | Verify `eagle_decoder_type` in the eagle3 recipe YAML | |
| 199 | +| Custom tokenizer (e.g., tiktoken) | Set `TIKTOKEN_RS_CACHE_DIR` env var in task_0 and task_1 | |
| 200 | +| NVFP4 quant model | task_0/task_3 use quant container; task_1/task_2 use BF16 base model — add `hf_model_bf16` global_var | |
| 201 | +| Model needs `trust_remote_code` at benchmark | Add `--trust-remote-code` to task_3 args | |
| 202 | + |
| 203 | +## Step 6 — Test with dry run |
| 204 | + |
| 205 | +Preview the resolved config before submitting: |
| 206 | + |
| 207 | +```bash |
| 208 | +uv run launch.py --yaml examples/<Org>/<Model>/hf_offline_eagle3.yaml --dryrun --yes -v |
| 209 | +``` |
| 210 | + |
| 211 | +## Step 7 — Update triage chart |
| 212 | + |
| 213 | +After adding a new model, add a row to the test matrix in |
| 214 | +`tools/launcher/examples/EAGLE3_TRIAGE.md` with status 🔲 (not yet tested). |
| 215 | +Fill in results after running. |
0 commit comments