-
Notifications
You must be signed in to change notification settings - Fork 431
[OMNIML-4962] specdec_bench cell t0_d3 — Qwen/Qwen3.5-4B / DFlash / vLLM #1638
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Open
ChenhanYu
wants to merge
13
commits into
main
Choose a base branch
from
pensieve-intern/OMNIML-4961/t0_d3
base: main
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
+118
−0
Open
Changes from all commits
Commits
Show all changes
13 commits
Select commit
Hold shift + click to select a range
6f55a4a
[OMNIML-4962] specdec_bench cell t0_d3
ChenhanYu ed6e946
fix: DFlash -> DFLASH (case-sensitive algorithm name)
ChenhanYu c602ca8
fix: add --draft_model_dir for DFLASH (required arg)
ChenhanYu 1280011
fix: use vllm v0.22.1 container (dflash support)
ChenhanYu 6d43a01
fix: _is_sensitive_key guard for non-string dict keys (vllm v0.22+ se…
ChenhanYu 5cbaa03
fix(dflash): add --block_size for DFLASH speculative tokens
ChenhanYu 618aceb
fix(dflash): --draft_length 3 → --block_size 4 for DFLASH cell t0_d3
ChenhanYu 6f17b10
style: ruff format --block_size argparse call
ChenhanYu 988aa7d
align cell YAML with #1564 sweep-name convention + add explanatory co…
ChenhanYu eaa360f
move specdec_bench infra changes to #1564 (cell becomes YAML-only)
ChenhanYu 637e062
drop redundant per-cell runtime_params for t0_d3
ChenhanYu 56f46c8
switch to --max_seq_len 40960 (matches the --max_seq_len CLI flag add…
ChenhanYu 6330882
Merge branch 'main' into pensieve-intern/OMNIML-4961/t0_d3
ChenhanYu File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
118 changes: 118 additions & 0 deletions
118
tools/launcher/examples/Qwen/Qwen3.5-4B/specdec_bench_dflash_vllm_t0_d3.yaml
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,118 @@ | ||
| # SPEED-bench DFlash speculative-decoding run for Qwen3.5-4B via vLLM, | ||
| # matrix cell t0_d3 (temperature=0, draft_length=3 → block_size=4). | ||
| # | ||
| # Companion to specdec_bench_mtp.yaml. This variant exercises the | ||
| # z-lab/Qwen3.5-4B-DFlash external draft model. DFlash ignores | ||
| # --draft_length (which maps to vLLM's speculative_num_steps); it reads | ||
| # `speculative_num_draft_tokens` instead, which we pass via --block_size | ||
| # = draft_length + 1. | ||
| # | ||
| # Two-task pipeline: | ||
| # task_0 Quantitative quality split (nvidia/SPEED-Bench-Internal/qualitative) | ||
| # task_1 Long-context throughput split (nvidia/SPEED-Bench-Internal/throughput_32k) | ||
| # | ||
| # Results write to /scratchspace/qwen35_4b_dflash_vllm_t0_d3/<split>/. | ||
| # The pensieve-intern `specdec_bench` workflow's wrap_up stage owns | ||
| # publishing these to s3://team-specdec-workgroup/results/qwen35_4b_dflash_vllm_t0_d3/<split>/ | ||
| # with provenance stamps (jira_ticket + huggingface_model_id). | ||
| # Sweep-name convention: <model>_<algorithm>_<engine>_<cell_tag> so | ||
| # multi-model / multi-engine / multi-cell records don't collide in S3. | ||
| # | ||
| # Container: vllm/vllm-openai:v0.22.1+ is required. The `dflash` | ||
| # speculative method landed in vLLM v0.22.0; the qwen3_5-cu130 image | ||
| # used by sibling MTP/NONE YAMLs predates this and rejects | ||
| # `--speculative_algorithm DFLASH` with "Input should be 'ngram', ..., | ||
| # 'mtp'". | ||
| # | ||
| # Slurm run on cw_dfw: | ||
| # uv run slurm.py --yaml modules/Model-Optimizer/tools/launcher/examples/Qwen/Qwen3.5-4B/specdec_bench_dflash_vllm_t0_d3.yaml --yes | ||
|
|
||
| job_name: Qwen3.5-4B_specdec_bench_dflash_vllm_t0_d3 | ||
|
|
||
| pipeline: | ||
| global_vars: | ||
| hf_model: /hf-local/Qwen/Qwen3.5-4B | ||
|
|
||
| # Step 1: qualitative split — quality / acceptance-rate numbers with | ||
| # DFlash block_size=4 (draft_length=3 + 1). tp_size=2 + concurrency=32 | ||
| # trades aa_timing fidelity for ~30x wall-clock speedup; | ||
| # acceptance-length (AL) is concurrency-independent and is the primary | ||
| # metric we care about for this split. | ||
| # | ||
| # No --temperature: run.py defaults sampling_kwargs to | ||
| # {"temperature": 0} when --temperature is not supplied, which is | ||
| # exactly what this cell (t0_*) wants. Cells with non-zero temperature | ||
| # (t1_d3 / t1_d7) will pass `--temperature 1` on the args list. | ||
| task_0: | ||
| script: common/specdec_bench/run.sh | ||
| args: | ||
| - --dataset speed | ||
| - --dataset_path /hf-local/nvidia/SPEED-Bench-Internal/qualitative | ||
| - --engine VLLM | ||
| - --speculative_algorithm DFLASH | ||
| - --draft_model_dir /hf-local/z-lab/Qwen3.5-4B-DFlash | ||
| - --block_size 4 | ||
| - --tp_size 2 | ||
| - --ep_size 1 | ||
| - --concurrency 32 | ||
| - --output_length 4096 | ||
| - --aa_timing | ||
| - --show_progress | ||
| - --save_dir /scratchspace/qwen35_4b_dflash_vllm_t0_d3/qualitative | ||
| environment: | ||
| - HF_MODEL_CKPT: <<global_vars.hf_model>> | ||
| - HF_LOCAL: /hf-local | ||
| slurm_config: | ||
| _factory_: "slurm_factory" | ||
| nodes: 1 | ||
| ntasks_per_node: 1 | ||
| gpus_per_node: 2 | ||
| container: vllm/vllm-openai:v0.22.1 | ||
|
|
||
| # Step 2: throughput_32k split — long-context throughput with DFlash | ||
| # block_size=4. `--num_requests 80` caps the run at 80 samples (split | ||
| # has 1,536) so it fits in the 4h Slurm time-limit; each 32K-input | ||
| # sample takes ~60-90s. tp_size=2 doubles the KV-cache budget across | ||
| # 2 GPUs; concurrency=8 keeps 8 * 32K = 256K tokens of in-flight KV | ||
| # under that doubled budget. | ||
| # | ||
| # --max_seq_len 40960 pins the engine's sequence-length cap for the | ||
| # 32K input + 4K output + 4K headroom; vLLM's auto-derivation from | ||
| # gpu_memory_utilization can otherwise cap below the 32K input we | ||
| # need. Generic CLI flag (run.py maps it to engine-specific kwarg — | ||
| # max_model_len for vLLM here). | ||
| task_1: | ||
| script: common/specdec_bench/run.sh | ||
| args: | ||
| - --dataset speed | ||
| - --dataset_path /hf-local/nvidia/SPEED-Bench-Internal/throughput_32k | ||
| - --engine VLLM | ||
| - --speculative_algorithm DFLASH | ||
| - --draft_model_dir /hf-local/z-lab/Qwen3.5-4B-DFlash | ||
| - --block_size 4 | ||
| - --max_seq_len 40960 | ||
| - --tp_size 2 | ||
| - --ep_size 1 | ||
| - --concurrency 8 | ||
| - --num_requests 80 | ||
| - --output_length 4096 | ||
| - --aa_timing | ||
| - --show_progress | ||
| - --save_dir /scratchspace/qwen35_4b_dflash_vllm_t0_d3/throughput_32k | ||
| environment: | ||
| - HF_MODEL_CKPT: <<global_vars.hf_model>> | ||
| - HF_LOCAL: /hf-local | ||
| slurm_config: | ||
| _factory_: "slurm_factory" | ||
| nodes: 1 | ||
| ntasks_per_node: 1 | ||
| gpus_per_node: 2 | ||
| container: vllm/vllm-openai:v0.22.1 | ||
|
|
||
|
|
||
| # S3 upload is intentionally not a task in this YAML — the bench | ||
| # pipeline only writes results to /scratchspace/qwen35_4b_dflash_vllm_t0_d3/<split>/. | ||
| # The pensieve-intern specdec_bench workflow's wrap_up stage owns | ||
| # harvesting these from lustre and publishing them to the team S3 vault | ||
| # with provenance stamps (jira_ticket + huggingface_model_id) for the | ||
| # "official record" tracking. | ||
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[IMPORTANT Compatibility] Hard runtime dependency on unmerged PR #1564.
Both tasks in this YAML invoke
--block_size 4, andtask_1additionally invokes--max_seq_len 40960. Neither flag exists inexamples/specdec_bench/run.pyonorigin/main— they are introduced by PR #1564 ([OMNIML-4788] specdec_bench/Qwen3.5-4B: throughput_32k benchmark + S3 upload step), which is currentlyOPEN. I verified bygit show origin/main:examples/specdec_bench/run.py— only--draft_length(which DFLASH ignores per the header docstring) is available.If this PR merges before #1564, anyone running this YAML through
slurm.py/launch.pyhits anargparseunrecognized argumentserror inrun.sh→run.py "${@}", before any benchmark work starts. Both tasks fail; the failure is loud (immediate non-zero exit) so there's no silent corruption risk, but the cell is unrunnable.Why it matters: the author's own commit
eaa360f7explicitly notes "this cell PR needs #1564 to land first," so the dependency is known — but nothing in the PR title, body, or the file itself surfaces it to a maintainer who clicks merge without reading the commit log.Suggested fix (any one):
# Requires the --block_size / --max_seq_len CLI flags from PR #1564 (OMNIML-4788) to be on main.This way the dependency travels with the file.