Skip to content

Commit dfcbb32

Browse files
committed
fix: disable EAGLE3 speculative decoding for gpt-oss-120b
Streaming responses were consistently dropping the last 1-2 tokens due to a vLLM v0.12.0 EAGLE3 bug. Non-streaming was unaffected.
1 parent 2e31674 commit dfcbb32

1 file changed

Lines changed: 0 additions & 1 deletion

File tree

small-models.yaml

Lines changed: 0 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -74,7 +74,6 @@ x-gpt-oss-common: &gpt-oss-common
7474
--enable-auto-tool-choice
7575
--max-model-len 128K
7676
--max-num-batched-tokens 8192
77-
--speculative-config '{"model":"nvidia/gpt-oss-120b-Eagle3-v2","num_speculative_tokens":3,"method":"eagle3","draft_tensor_parallel_size":1}'
7877
--load-format runai_streamer
7978
--model-loader-extra-config '{"distributed":true, "concurrency":48}'
8079
volumes:

0 commit comments

Comments
 (0)