[OMNIML-4869] author_yaml (#1574)

ChenhanYu · web-flow · commit 54fb87e972f4 · 2026-05-31T11:53:00.000+05:30
Draft PR opened by **pensieve-intern** for [OMNIML-4869](https://jirasw.nvidia.com/browse/OMNIML-4869). Stage `author_yaml` of Epic `OMNIML-4868`. The agent ran from the SPEC on the ticket description; review every change before marking ready. _Always-draft is enforced — the bot never auto-merges._ --- **Agent's self-narration** (stripped from PR diff; surfaced here for context): `VERIFICATION_COMMENT.txt`: ``` OMNIML-4869 status: model not staged on cw_dfw. Spec checks: - HF Hub model exists: Qwen/Qwen3.5-4B (public, not gated; model_type=qwen3_5). - Cluster stage check failed: /lustre/fsw/portfolios/coreai/projects/coreai_dlalgo_modelopt/hf-local/Qwen/Qwen3.5-4B not found. Requested action on Epic: please stage the model via `/manage-assets cw_dfw get Qwen/Qwen3.5-4B`. Notes: - YAML already exists at tools/launcher/examples/Qwen/Qwen3.5-4B/specdec_bench_mtp_vllm.yaml with tp_size=1, gpus_per_node=1, and container vllm/vllm-openai:qwen3_5-cu130. - Once staged, I can run the required dry-run: `uv run launch.py --yaml tools/launcher/examples/Qwen/Qwen3.5-4B/specdec_bench_mtp_vllm.yaml --dryrun --yes -v`. ``` _Pollution-strip removed `VERIFICATION_COMMENT.txt` from this commit (sidecar narration and/or incidental lockfile regeneration are never part of the agent's intended deliverable)._  ## Summary by CodeRabbit * **Chores** * Added a benchmark configuration to evaluate speculative decoding (MTP) performance for the Qwen3.5-4B model, enabling separate speed and high-throughput (32k) runs with adjustable decoding and runtime parameters. * Configured execution settings for single-node GPU benchmarking and standardized container/runtime invocation for reproducible performance tests.  Signed-off-by: Chenhan Yu <chenhany@nvidia.com>
diff --git a/tools/launcher/examples/Qwen/Qwen3.5-4B/specdec_bench_mtp_vllm.yaml b/tools/launcher/examples/Qwen/Qwen3.5-4B/specdec_bench_mtp_vllm.yaml
@@ -0,0 +1,67 @@
+# SPEED-bench MTP speculative-decoding run for Qwen3.5-4B via vLLM.
+#
+# The qwen3_5 model_type needs transformers >= 4.58, which is NOT in
+# vllm/vllm-openai:latest yet — use the qwen3_5-cu130 tag instead.
+#
+# Slurm run on cw_dfw:
+#   uv run slurm.py --yaml modules/Model-Optimizer/tools/launcher/examples/Qwen/Qwen3.5-4B/specdec_bench_mtp_vllm.yaml --yes
+
+job_name: Qwen3.5-4B_specdec_bench_mtp_vllm
+
+pipeline:
+  global_vars:
+    hf_model: /hf-local/Qwen/Qwen3.5-4B
+
+  # task_0: SPEED qualitative split
+  task_0:
+    script: common/specdec_bench/run.sh
+    args:
+      - --dataset speed
+      - --dataset_path /hf-local/nvidia/SPEED-Bench-Internal/qualitative
+      - --engine VLLM
+      - --speculative_algorithm MTP
+      - --draft_length 3
+      - --tp_size 1
+      - --ep_size 1
+      - --concurrency 32
+      - --output_length 4096
+      - --aa_timing
+      - --show_progress
+      - --save_dir /scratchspace/{sweep_name_default}/qualitative
+    environment:
+      - HF_MODEL_CKPT: <<global_vars.hf_model>>
+      - HF_LOCAL: /hf-local
+    slurm_config:
+      _factory_: "slurm_factory"
+      nodes: 1
+      ntasks_per_node: 1
+      gpus_per_node: 1
+      container: vllm/vllm-openai:qwen3_5-cu130
+
+  # task_1: SPEED throughput_32k split
+  task_1:
+    script: common/specdec_bench/run.sh
+    args:
+      - --dataset speed
+      - --dataset_path /hf-local/nvidia/SPEED-Bench-Internal/throughput_32k
+      - --engine VLLM
+      - --speculative_algorithm MTP
+      - --draft_length 3
+      - --tp_size 1
+      - --ep_size 1
+      - --concurrency 8
+      - --num_requests 80
+      - --runtime_params common/specdec_bench/runtime_params_throughput_32k.yaml
+      - --output_length 4096
+      - --aa_timing
+      - --show_progress
+      - --save_dir /scratchspace/{sweep_name_default}/throughput_32k
+    environment:
+      - HF_MODEL_CKPT: <<global_vars.hf_model>>
+      - HF_LOCAL: /hf-local
+    slurm_config:
+      _factory_: "slurm_factory"
+      nodes: 1
+      ntasks_per_node: 1
+      gpus_per_node: 1
+      container: vllm/vllm-openai:qwen3_5-cu130