small-models: drop gemma-4-31B-it context-length cap#50
Closed
Evrard-Nil wants to merge 1 commit into
Closed
Conversation
Removes --context-length 32768 so sglang auto-detects gemma4's native
max context from the model config instead of capping at 32K.
The 32K cap was carried over from the old vLLM --max-model-len and was
the source of a steady stream of upstream 400s
("This model's maximum context length is 32768 tokens...") for clients
sending longer prompts. gemma4's KV pool on this TP=2 config allocates
max_total_num_tokens=248222 (a 198K main pool + SWA pool), so single
requests up to the native context fit comfortably under the pool.
Contributor
Author
|
Superseded by v0.0.194 on main, which drops both |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Removes
--context-length 32768from the gemma-4-31B-it SGLang command so the engine auto-detects gemma4's native max context from the model config instead of capping at 32K.Why
The 32K cap was carried over verbatim from the old vLLM
--max-model-len 32768. It's the source of a steady stream of upstream400s we see in cloud-api logs:gemma4 natively supports far more than 32K. On the current TP=2 config the engine allocates
max_total_num_tokens=248222(a ~198K main KV pool plus the gemma4 SWA pool — confirmed in the gpu11 boot logs), so single requests up to the native context length fit comfortably under the pool budget. Dropping the explicit cap lets long-context requests through instead of 400ing at 32K.Notes
context_len=<native>at boot; verify it picks up the model'smax_position_embeddingsand thatmax_total_num_tokensis still comfortably ≥ a single max-length request after deploy.Test plan
docker compose -f small-models.yaml configvalidates (pre-commit hook passes)context_len(not 32768) andThe server is fired upgoogle/gemma-4-31B-itvia cloud-api, confirm 200 instead of the old 400max_total_num_tokensstill ≥ native context (no KV-pool-too-small rejection)