Skip to content

Commit 54fb87e

Browse files
authored
[OMNIML-4869] author_yaml (#1574)
Draft PR opened by **pensieve-intern** for [OMNIML-4869](https://jirasw.nvidia.com/browse/OMNIML-4869). Stage `author_yaml` of Epic `OMNIML-4868`. The agent ran from the SPEC on the ticket description; review every change before marking ready. _Always-draft is enforced — the bot never auto-merges._ --- **Agent's self-narration** (stripped from PR diff; surfaced here for context): `VERIFICATION_COMMENT.txt`: ``` OMNIML-4869 status: model not staged on cw_dfw. Spec checks: - HF Hub model exists: Qwen/Qwen3.5-4B (public, not gated; model_type=qwen3_5). - Cluster stage check failed: /lustre/fsw/portfolios/coreai/projects/coreai_dlalgo_modelopt/hf-local/Qwen/Qwen3.5-4B not found. Requested action on Epic: please stage the model via `/manage-assets cw_dfw get Qwen/Qwen3.5-4B`. Notes: - YAML already exists at tools/launcher/examples/Qwen/Qwen3.5-4B/specdec_bench_mtp_vllm.yaml with tp_size=1, gpus_per_node=1, and container vllm/vllm-openai:qwen3_5-cu130. - Once staged, I can run the required dry-run: `uv run launch.py --yaml tools/launcher/examples/Qwen/Qwen3.5-4B/specdec_bench_mtp_vllm.yaml --dryrun --yes -v`. ``` _Pollution-strip removed `VERIFICATION_COMMENT.txt` from this commit (sidecar narration and/or incidental lockfile regeneration are never part of the agent's intended deliverable)._ <!-- This is an auto-generated comment: release notes by coderabbit.ai --> ## Summary by CodeRabbit * **Chores** * Added a benchmark configuration to evaluate speculative decoding (MTP) performance for the Qwen3.5-4B model, enabling separate speed and high-throughput (32k) runs with adjustable decoding and runtime parameters. * Configured execution settings for single-node GPU benchmarking and standardized container/runtime invocation for reproducible performance tests. <!-- end of auto-generated comment: release notes by coderabbit.ai --> Signed-off-by: Chenhan Yu <chenhany@nvidia.com>
1 parent 7ae4ee7 commit 54fb87e

1 file changed

Lines changed: 67 additions & 0 deletions

File tree

Lines changed: 67 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,67 @@
1+
# SPEED-bench MTP speculative-decoding run for Qwen3.5-4B via vLLM.
2+
#
3+
# The qwen3_5 model_type needs transformers >= 4.58, which is NOT in
4+
# vllm/vllm-openai:latest yet — use the qwen3_5-cu130 tag instead.
5+
#
6+
# Slurm run on cw_dfw:
7+
# uv run slurm.py --yaml modules/Model-Optimizer/tools/launcher/examples/Qwen/Qwen3.5-4B/specdec_bench_mtp_vllm.yaml --yes
8+
9+
job_name: Qwen3.5-4B_specdec_bench_mtp_vllm
10+
11+
pipeline:
12+
global_vars:
13+
hf_model: /hf-local/Qwen/Qwen3.5-4B
14+
15+
# task_0: SPEED qualitative split
16+
task_0:
17+
script: common/specdec_bench/run.sh
18+
args:
19+
- --dataset speed
20+
- --dataset_path /hf-local/nvidia/SPEED-Bench-Internal/qualitative
21+
- --engine VLLM
22+
- --speculative_algorithm MTP
23+
- --draft_length 3
24+
- --tp_size 1
25+
- --ep_size 1
26+
- --concurrency 32
27+
- --output_length 4096
28+
- --aa_timing
29+
- --show_progress
30+
- --save_dir /scratchspace/{sweep_name_default}/qualitative
31+
environment:
32+
- HF_MODEL_CKPT: <<global_vars.hf_model>>
33+
- HF_LOCAL: /hf-local
34+
slurm_config:
35+
_factory_: "slurm_factory"
36+
nodes: 1
37+
ntasks_per_node: 1
38+
gpus_per_node: 1
39+
container: vllm/vllm-openai:qwen3_5-cu130
40+
41+
# task_1: SPEED throughput_32k split
42+
task_1:
43+
script: common/specdec_bench/run.sh
44+
args:
45+
- --dataset speed
46+
- --dataset_path /hf-local/nvidia/SPEED-Bench-Internal/throughput_32k
47+
- --engine VLLM
48+
- --speculative_algorithm MTP
49+
- --draft_length 3
50+
- --tp_size 1
51+
- --ep_size 1
52+
- --concurrency 8
53+
- --num_requests 80
54+
- --runtime_params common/specdec_bench/runtime_params_throughput_32k.yaml
55+
- --output_length 4096
56+
- --aa_timing
57+
- --show_progress
58+
- --save_dir /scratchspace/{sweep_name_default}/throughput_32k
59+
environment:
60+
- HF_MODEL_CKPT: <<global_vars.hf_model>>
61+
- HF_LOCAL: /hf-local
62+
slurm_config:
63+
_factory_: "slurm_factory"
64+
nodes: 1
65+
ntasks_per_node: 1
66+
gpus_per_node: 1
67+
container: vllm/vllm-openai:qwen3_5-cu130

0 commit comments

Comments
 (0)