small-models: drop gemma-4-31B-it context-length cap by Evrard-Nil · Pull Request #50 · nearai/cvm-compose-files

Evrard-Nil · 2026-05-27T09:38:26Z

Summary

Removes --context-length 32768 from the gemma-4-31B-it SGLang command so the engine auto-detects gemma4's native max context from the model config instead of capping at 32K.

Why

The 32K cap was carried over verbatim from the old vLLM --max-model-len 32768. It's the source of a steady stream of upstream 400s we see in cloud-api logs:

HTTP error 400: This model's maximum context length is 32768 tokens.
However, you requested N output tokens and your prompt contains M input tokens...

gemma4 natively supports far more than 32K. On the current TP=2 config the engine allocates max_total_num_tokens=248222 (a ~198K main KV pool plus the gemma4 SWA pool — confirmed in the gpu11 boot logs), so single requests up to the native context length fit comfortably under the pool budget. Dropping the explicit cap lets long-context requests through instead of 400ing at 32K.

Notes

SGLang will log the resolved context_len=<native> at boot; verify it picks up the model's max_position_embeddings and that max_total_num_tokens is still comfortably ≥ a single max-length request after deploy.
No other flags touched.

Test plan

docker compose -f small-models.yaml config validates (pre-commit hook passes)
Deploy to gpu11, confirm boot log shows native context_len (not 32768) and The server is fired up
Send a >32K-token prompt to google/gemma-4-31B-it via cloud-api, confirm 200 instead of the old 400
Confirm max_total_num_tokens still ≥ native context (no KV-pool-too-small rejection)
Watch cloud-api 400 rate for the "maximum context length is 32768" message drop to ~0

Removes --context-length 32768 so sglang auto-detects gemma4's native max context from the model config instead of capping at 32K. The 32K cap was carried over from the old vLLM --max-model-len and was the source of a steady stream of upstream 400s ("This model's maximum context length is 32768 tokens...") for clients sending longer prompts. gemma4's KV pool on this TP=2 config allocates max_total_num_tokens=248222 (a 198K main pool + SWA pool), so single requests up to the native context fit comfortably under the pool.

Evrard-Nil · 2026-05-27T10:30:00Z

Superseded by v0.0.194 on main, which drops both --context-length 32768 and --max-running-requests 64 (this PR removed only the context-length cap). The change shipped via the backdated tag deploy flow instead.

Evrard-Nil requested a review from lloydmak99 May 27, 2026 09:38

Evrard-Nil closed this May 27, 2026

Evrard-Nil deleted the gemma4-remove-context-limit branch May 27, 2026 10:30

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

small-models: drop gemma-4-31B-it context-length cap#50

small-models: drop gemma-4-31B-it context-length cap#50
Evrard-Nil wants to merge 1 commit into
mainfrom
gemma4-remove-context-limit

Evrard-Nil commented May 27, 2026

Uh oh!

Evrard-Nil commented May 27, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

Evrard-Nil commented May 27, 2026

Summary

Why

Notes

Test plan

Uh oh!

Evrard-Nil commented May 27, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant