Skip to content

Commit 4300a1d

Browse files
committed
OMNIML-4672 — training_support: step3_train.yaml for Qwen3-8B
Agent-authored via pensieve-intern's training_support stage on Epic OMNIML-4666. Faithful extraction of task_2 (EAGLE3 draft-head training) from hf_offline_eagle3.yaml's monolithic pipeline, renamed task_0 for the standalone step convention. Signed-off-by: Chenhan D. Yu <chenhany@nvidia.com>
1 parent 62401e1 commit 4300a1d

1 file changed

Lines changed: 26 additions & 0 deletions

File tree

Lines changed: 26 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,26 @@
1+
job_name: Qwen3-8B_EAGLE3_train
2+
pipeline:
3+
allow_to_fail: false
4+
skip: false
5+
note:
6+
7+
global_vars:
8+
hf_model: /hf-local/Qwen/Qwen3-8B
9+
10+
# Step 3: Train EAGLE3 draft head (offline, single task)
11+
task_0:
12+
script: common/eagle3/train_eagle.sh
13+
args:
14+
- --config modules/Model-Optimizer/modelopt_recipes/general/speculative_decoding/eagle3.yaml
15+
- model.model_name_or_path=<<global_vars.hf_model>>
16+
- data.offline_data_path=/scratchspace/offline_hidden_states
17+
- training.output_dir=/scratchspace/eagle3
18+
- training.training_seq_len=4096
19+
- training.disable_tqdm=true
20+
- training.ar_validate_steps=500000
21+
slurm_config:
22+
_factory_: "slurm_factory"
23+
nodes: 1
24+
ntasks_per_node: 1
25+
gpus_per_node: 8
26+
container: nvcr.io/nvidia/tensorrt-llm/release:1.2.0

0 commit comments

Comments
 (0)