Skip to content

small-models: drop gemma-4-31B-it context-length cap#50

Closed
Evrard-Nil wants to merge 1 commit into
mainfrom
gemma4-remove-context-limit
Closed

small-models: drop gemma-4-31B-it context-length cap#50
Evrard-Nil wants to merge 1 commit into
mainfrom
gemma4-remove-context-limit

Conversation

@Evrard-Nil
Copy link
Copy Markdown
Contributor

Summary

Removes --context-length 32768 from the gemma-4-31B-it SGLang command so the engine auto-detects gemma4's native max context from the model config instead of capping at 32K.

Why

The 32K cap was carried over verbatim from the old vLLM --max-model-len 32768. It's the source of a steady stream of upstream 400s we see in cloud-api logs:

HTTP error 400: This model's maximum context length is 32768 tokens.
However, you requested N output tokens and your prompt contains M input tokens...

gemma4 natively supports far more than 32K. On the current TP=2 config the engine allocates max_total_num_tokens=248222 (a ~198K main KV pool plus the gemma4 SWA pool — confirmed in the gpu11 boot logs), so single requests up to the native context length fit comfortably under the pool budget. Dropping the explicit cap lets long-context requests through instead of 400ing at 32K.

Notes

  • SGLang will log the resolved context_len=<native> at boot; verify it picks up the model's max_position_embeddings and that max_total_num_tokens is still comfortably ≥ a single max-length request after deploy.
  • No other flags touched.

Test plan

  • docker compose -f small-models.yaml config validates (pre-commit hook passes)
  • Deploy to gpu11, confirm boot log shows native context_len (not 32768) and The server is fired up
  • Send a >32K-token prompt to google/gemma-4-31B-it via cloud-api, confirm 200 instead of the old 400
  • Confirm max_total_num_tokens still ≥ native context (no KV-pool-too-small rejection)
  • Watch cloud-api 400 rate for the "maximum context length is 32768" message drop to ~0

Removes --context-length 32768 so sglang auto-detects gemma4's native
max context from the model config instead of capping at 32K.

The 32K cap was carried over from the old vLLM --max-model-len and was
the source of a steady stream of upstream 400s
("This model's maximum context length is 32768 tokens...") for clients
sending longer prompts. gemma4's KV pool on this TP=2 config allocates
max_total_num_tokens=248222 (a 198K main pool + SWA pool), so single
requests up to the native context fit comfortably under the pool.
@Evrard-Nil Evrard-Nil requested a review from lloydmak99 May 27, 2026 09:38
@Evrard-Nil
Copy link
Copy Markdown
Contributor Author

Superseded by v0.0.194 on main, which drops both --context-length 32768 and --max-running-requests 64 (this PR removed only the context-length cap). The change shipped via the backdated tag deploy flow instead.

@Evrard-Nil Evrard-Nil closed this May 27, 2026
@Evrard-Nil Evrard-Nil deleted the gemma4-remove-context-limit branch May 27, 2026 10:30
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant