[Fix] Add thinker input length check and parameter passing by Jayon02 · Pull Request #330 · sgl-project/sglang-omni

Jayon02 · 2026-04-21T17:23:24Z

Motivation

Modifications

sglang_omni/models/qwen3_omni/config.py: support thinker_max_seq_len in server_args_overrides and propagate it into the thinker stage config.
sglang_omni/models/qwen3_omni/pipeline/engine_io.py: add an explicit prompt-length guard before thinker execution and raise a clear error when requests exceed thinker_max_seq_len.
sglang_omni/models/qwen3_omni/pipeline/stages.py: wire thinker_max_seq_len into both thinker request paths so the new length guard uses the configured limit consistently.
sglang_omni/models/qwen3_omni/components/preprocessor.py: set a default video_fps for video requests in the Qwen3-Omni preprocessor.
examples/run_qwen3_omni_server.py and examples/run_qwen3_omni_text_first.py: pass thinker_max_seq_len through the existing override path instead of patching the thinker stage config manually after construction.
sglang_omni/cli/serve.py: expose --thinker-max-seq-len on the unified server CLI and apply it to the thinker stage config during server startup.
Preserve both the override path and direct CLI parameter passing.
Add exception handling for requests that exceed thinker_max_seq_len, returning 400 Bad Request.

Related Issues

#327 phase 1: Bug fix.

Accuracy Test

Functionality Test

python -m sglang_omni.cli.cli serve   --model-path Qwen/Qwen3-Omni-30B-A3B-Instruct   --text-only   --thinker-max-seq-len 128   --port 8000

curl http://127.0.0.1:8000/v1/chat/completions   -H "Content-Type: application/json"   -d '{
    "model": "Qwen/Qwen3-Omni-30B-A3B-Instruct",
    "messages": [{
      "role": "user",
      "content": "Please repeat the following text exactly and then summarize it: a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a"
    }],
    "max_tokens": 16,
    "stream": false
  }'

Benchmark & Profiling

Checklist

Format your code according with pre-commit.
Add unit tests.
Update documentation / docstrings / example tutorials as needed.
Provide throughput / latency benchmark results and accuracy evaluation results as needed.
For reviewers: If you haven't made any contributions to this PR and are only assisting with merging the main branch, please remove yourself as a co-author when merging the PR.

Ratish1

Overall Looks good. Just a few comments below.

Ccyest · 2026-04-21T21:08:40Z

+    except RuntimeError as exc:
+        if _is_bad_request_error(exc):
+            raise HTTPException(status_code=400, detail=str(exc)) from exc
+        raise HTTPException(status_code=500, detail=str(exc)) from exc


Found while reviewing. The PR adds a prompt-length guard for the thinker stage and needs to return HTTP 400 (not 500) when a request exceeds thinker_max_seq_len.

The current implementation applies a string checker _is_bad_request_error to decide the status code. This works, but it's fragile — and the root cause is in the pipeline's error propagation, which raise a topic worth considering.

Did a bit investigation here:

How errors propagate today

engine_io.py raise ValueError("Prompt length 5000 exceeds thinker_max_seq_len 4096") ↓ worker/runtime.py except Exception as e: _send_failure(request_id, str(e)) # type info lost here ↓ CompleteMessage CompleteMessage(success=False, error="Prompt length 5000 exceeds ...") ↓ coordinator.py raise RuntimeError(msg.error) # rebuilt as RuntimeError ↓ openai_api.py if "exceeds thinker_max_seq_len" in str(exc) # string matching → 400

The worker serializes exceptions to plain strings (str(e)), and the coordinator rebuilds them all as RuntimeError. Original exception types are lost entirely.

Why it matters

Fragile: Changing the error message wording silently breaks the status code mapping.

Doesn't scale: Each new client-error type (400, 413, 429, etc.) needs another substring check. This will become spaghetti quickly.

Affects streaming too: stage/runtime.py:541,550 reconstructs errors with RuntimeError(msg.error) — same pattern.

Possible fix: add error_code to CompleteMessage

# proto/messages.py @dataclass class CompleteMessage: request_id: str from_stage: str success: bool result: Any = None error: str | None = None error_code: str | None = None # e.g. "PROMPT_TOO_LONG"

# engine_io.py — error source class PromptTooLongError(ValueError): error_code = "PROMPT_TOO_LONG"

# worker/runtime.py — preserve the code through serialization except Exception as e: error_code = getattr(e, "error_code", None) await self._send_failure(request_id, str(e), error_code=error_code)

# coordinator.py — attach code to the rebuilt exception exc = RuntimeError(msg.error or "Unknown error") exc.error_code = msg.error_code raise exc

# openai_api.py — classify by code, not by string except RuntimeError as exc: if getattr(exc, "error_code", None) == "PROMPT_TOO_LONG": raise HTTPException(status_code=400, detail=str(exc)) from exc raise HTTPException(status_code=500, detail=str(exc)) from exc

Backward-compatible: error_code defaults to None, existing errors are unaffected.

Summarize Open questions

Is the single string-match in PR [Fix] Add thinker input length check and parameter passing #330 acceptable for the time being?

Should we define a PipelineErrorCode enum up front, or just use free-form strings until we have more error types?

cc @zhaochenyang20

zhaochenyang20

int(thinker_max_seq_len) only casts the type; it does not reject zero or negatives. A user passing --thinker-max-seq-len 0 will see every request fail with "Prompt length N exceeds thinker_max_seq_len 0" — technically valid but unhelpful. Add a if thinker_max_seq_len <= 0: raise typer.BadParameter(...) guard

zhaochenyang20

As a wrapper around SGLang, sglang-omni's outward-facing contract (HTTP status codes, finish_reason, error semantics) must strictly match SGLang upstream — otherwise upper-layer clients switching between sglang and sglang-omni will hit silent divergences. I walked through SGLang source to establish the reference behavior for three scenarios:

Scenario	SGLang behavior	Source
(a) `prompt_len > context_length`	`tokenizer_manager._validate_one_request` raises `ValueError("The input (N tokens) is longer than the model's context length (M tokens).")` → `serving_base.py` catches `except ValueError` uniformly → HTTP 400	`managers/tokenizer_manager.py:758-771`, `entrypoints/openai/serving_base.py:127-132`
(b) `prompt_len + max_new_tokens > context_length`	Same `_validate_one_request` raises `ValueError("Requested token count exceeds the model's maximum context length...")` → HTTP 400	`managers/tokenizer_manager.py:773-798`
(c) Decode reaches `max_new_tokens` without EOS	`schedule_batch.py` sets `FINISH_LENGTH(length=max_new_tokens)`, commented `# to match OpenAI API's return value` → HTTP 200 + `finish_reason="length"`	`managers/schedule_batch.py:145-154, 1047-1051`

Scenario	This PR's behavior	Aligned with SGLang?
(a)	`_validate_prompt_seq_len` raises `ValueError` → worker stringifies → coordinator re-wraps as `RuntimeError` → `_is_bad_request_error` substring match → HTTP 400 (non-streaming only)	Semantically yes, but implementation uses substring match instead of exception class (SGLang uses `except ValueError` blanket mapping), and streaming does not get 400
(b)	Not handled at all. PR only validates the prompt side, not `prompt + max_new_tokens`. If a user's prompt is near `thinker_max_seq_len`, adding `max_new_tokens` may still hit SGLang's own KV-cache ceiling; behavior is unclear (SGLang may truncate again or fail)	Not aligned
(c)	Relies on SGLang's own `FINISH_LENGTH`; sglang-omni just needs to propagate. But must be tested	Needs a test proving `finish_reason="length"` propagates correctly from thinker all the way to HTTP response

Add (b) validation at the entry point: _validate_prompt_seq_len should also accept max_new_tokens and check prompt_len + max_new_tokens >= max_seq_len (note SGLang uses >= not > because it reserves headroom via num_reserved_tokens); align the error message wording with SGLang's "Requested token count exceeds...". This makes scenario (b) return a clean 400.
Test the (c) path: finish_reason="length" must travel from the thinker stage's ARRequestData.finish_reason / SGLang Req.finished_reason all the way to CompletionResult.finish_reason in the HTTP response. apply_thinker_result (engine_io.py:480) currently hardcodes "is_final": True without reading FINISH_LENGTH — confirm this chain does not drop SGLang's finish_reason.
Error propagation must use exception class, not substring: Ccyest's proposal is directionally right; long term, preserve exception class (which is what SGLang upstream does). Short term, at minimum consolidate _is_bad_request_error's marker in one place and assert on the phrase, or use isinstance(original_error, ValueError) as the signal (requires worker to preserve exception type).
Must add unit tests: all three scenarios must have unit tests that strictly assert HTTP status and finish_reason. These assertions should reference SGLang's corresponding case behavior as oracle — so that a behavior drift in this wrapper layer after a SGLang upgrade is caught immediately by CI.

zhaochenyang20 · 2026-04-22T00:37:40Z

 MIME_TO_FORMAT = {mime: fmt for fmt, mime in FORMAT_MIME_TYPES.items()}


+def _is_bad_request_error(exc: Exception) -> bool:


_is_bad_request_error decides 400-vs-500 via "exceeds thinker_max_seq_len" in str(exc). Fragility is obvious: reword the phrase in _validate_prompt_seq_len and 400 silently downgrades to 500. Once P0 above adds the "Requested token count exceeds..." error from case (b), the current substring will miss it — another branch to add.

Root cause is error serialization at the worker → coordinator boundary (exception class lost). The correct long-term fix is Ccyest's CompleteMessage.error_code proposal, which is framework-level refactor — not in this PR. Short-term: add a TODO, file a follow-up issue, and consolidate the matched phrases into constants so the P0 additions are synchronized.

Fix:

# sglang_omni/serve/openai_api.py _BAD_REQUEST_MARKERS = ( "longer than the", # case (a) — matches SGLang wording "Requested token count exceeds", # case (b) — matches SGLang wording ) def _is_bad_request_error(exc: Exception) -> bool: # TODO(#<new-follow-up-issue>): replace with structured error code. # Worker → coordinator currently serializes exceptions to str, so # 400 vs 500 must be recovered via phrase match. See Ccyest's proposal # on #330 for the end-to-end design (CompleteMessage.error_code). # These markers must stay in sync with SGLang's ValueError wording: # - managers/tokenizer_manager.py:761, 791 message = str(exc) return any(marker in message for marker in _BAD_REQUEST_MARKERS)

Jayon02 · 2026-04-22T12:03:47Z

Thanks to all reviewers for the detailed feedback!

At this point, I have only addressed the non-streaming path and added thinker input/output length handling. Since the project will likely go through a broader refactor soon, I have not tried to fully solve all of the remaining issues within this PR. Instead, I added follow-up notes in the codebase for the current gaps, including streaming error handling and a more structured error propagation path instead of the current string matching approach.

Ratish1

Thanks for addressing the review comments. It makes sense for now to not address the streaming issue. I just have one more comment, lmk what do you think about it.

Ratish1

LGTM

cc: @zhaochenyang20

…ejects, robust test prompt - cli/serve.py: route --thinker-max-seq-len through apply_server_args_overrides so the CLI shares the same path as Qwen3OmniPipelineConfig / Qwen3OmniSpeechPipelineConfig; also raise typer.BadParameter when the pipeline has no thinker stage, matching the --thinker-mem-fraction-static precedent. - engine_io.py: extract _DEFAULT_THINKER_MAX_NEW_TOKENS so the validator and the actual generation can never drift; log rejected requests with request_id before raising so operators can attribute 400s. - test_qwen3_omni_thinker_length: prompt uses "a " * 10000 (space-separated) so BPE merges cannot collapse the count below THINKER_MAX_SEQ_LEN. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…thinker_max_seq_len=32768 and encoder reserve=0.20

Runs the 50-sample videomme-ci-50 subset at concurrency=4 with the thinker-only server (--thinker-max-seq-len 32768 --encoder-mem-reserve 0.20) and asserts accuracy, zero failures, and per-concurrency speed thresholds derived from a 5-run H200 calibration with apply_slack (0.75/1.25). Accuracy floor is the worst-observed 0.54 with no slack; any PR losing correct answers below that floor on a cold run fails the test. The calibration window is wider than an earlier snapshot (clustered at {0.60, 0.60, 0.60, 0.60, 0.62}) because current main-line changes since that snapshot (PR sgl-project#318 mem_fraction defaults, sgl-project#319 talker micro-batching, sgl-project#330 thinker input-length check) altered internal scheduling determinism. Speed metrics improved in the same window; accuracy spread widened. The 0.54 floor reflects the worst observed cold-run accuracy at the time of this commit. The server fixture is module-scoped and passes both CLI flags so that the test is pinned to the configuration that produced the calibration, independent of future factory-default changes. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Runs the 50-sample videomme-ci-50 subset at concurrency=4 with the thinker-only server (--thinker-max-seq-len 32768 --encoder-mem-reserve 0.20) and asserts accuracy, zero failures, and per-concurrency speed thresholds derived from a 5-run H200 calibration with apply_slack (0.75/1.25). Thresholds (vs pre-rebase-snapshot values, both from 5 back-to-back fresh-server pytest runs on the same H200): VIDEOMME_MIN_ACCURACY 0.60 -> 0.54 (widened floor) _VIDEOMME_P95.throughput 0.078 -> 0.084 (faster, tightens slack band) _VIDEOMME_P95.toks_agg 2.3 -> 2.5 (faster, tightens slack band) _VIDEOMME_P95.latency_s 50.3 -> 47.1 (faster, tightens slack band) The accuracy floor dropped because current main widens the cold-run accuracy spread from {0.60, 0.60, 0.60, 0.60, 0.62} to {0.62, 0.54, 0.58, 0.62, 0.58} (correct in {27, 29, 29, 31, 31} / 50, 0 failed every run). Main-line changes that landed between the two calibration windows (PR sgl-project#318 mem_fraction defaults, sgl-project#319 talker micro-batching, sgl-project#330 thinker input-length check) altered internal scheduling determinism; speed metrics improved in the same window while accuracy spread widened. 0.54 is the worst-observed cold-run accuracy and is enforced with no slack — any PR that loses a correct answer below that floor fails the test. The server fixture is module-scoped and passes both CLI flags so that the test is pinned to the configuration that produced the calibration, independent of future factory-default changes. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Jayon02 requested review from FrankLeeeee and shuaills as code owners April 21, 2026 17:23

Ratish1 reviewed Apr 21, 2026

View reviewed changes

Comment thread sglang_omni/models/qwen3_omni/config.py Outdated

Comment thread sglang_omni/serve/openai_api.py

Ccyest reviewed Apr 21, 2026

View reviewed changes

[Fix] Add thinker length guards and parameter plumbing

2d501f1

zhaochenyang20 force-pushed the thinker_length_fixes branch from 95ce3e7 to 032d1cc Compare April 21, 2026 23:37

zhaochenyang20 requested changes Apr 22, 2026

View reviewed changes

add exception handler

abb6061

zhaochenyang20 force-pushed the thinker_length_fixes branch from 032d1cc to abb6061 Compare April 22, 2026 00:31

zhaochenyang20 requested changes Apr 22, 2026

View reviewed changes

Jayon02 added 5 commits April 22, 2026 12:59

align scenario b

895c1c0

add follow-up TODO

d0f23ff

update

09b1af6

format

258ba1f

update

04f92ff

Jayon02 requested a review from zhaochenyang20 April 22, 2026 12:04

Ratish1 reviewed Apr 22, 2026

View reviewed changes

Comment thread sglang_omni/models/qwen3_omni/pipeline/engine_io.py Outdated

align thinker max_new_tokens validation with execution

8feb514

Ratish1 approved these changes Apr 22, 2026

View reviewed changes

zhaochenyang20 requested changes Apr 22, 2026

View reviewed changes

zhaochenyang20 and others added 3 commits April 22, 2026 18:23

update ci

d5bebda

update ci for S2

ff78566

zhaochenyang20 approved these changes Apr 22, 2026

View reviewed changes

zhaochenyang20 merged commit 06afb6c into sgl-project:main Apr 22, 2026
21 of 23 checks passed

zhaochenyang20 added a commit to Jayon02/sglang-omni that referenced this pull request Apr 23, 2026

Merge issue-253 (fix sgl-project#330+ defaults for videomme): brings …

76cf1a6

…thinker_max_seq_len=32768 and encoder reserve=0.20

zhaochenyang20 mentioned this pull request Apr 24, 2026

[Feat] Expose encoder mem reserve as --encoder-mem-reserve CLI flag #339

Merged

		MIME_TO_FORMAT = {mime: fmt for fmt, mime in FORMAT_MIME_TYPES.items()}


		def _is_bad_request_error(exc: Exception) -> bool:

Conversation

Jayon02 commented Apr 21, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Motivation

Modifications

Related Issues

Accuracy Test

Functionality Test

Benchmark & Profiling

Checklist

Uh oh!

Ratish1 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Ccyest Apr 21, 2026

Choose a reason for hiding this comment

How errors propagate today

Why it matters

Possible fix: add error_code to CompleteMessage

Summarize Open questions

Uh oh!

zhaochenyang20 left a comment

Choose a reason for hiding this comment

Uh oh!

zhaochenyang20 left a comment

Choose a reason for hiding this comment

Uh oh!

zhaochenyang20 Apr 22, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Jayon02 commented Apr 22, 2026

Uh oh!

Ratish1 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Ratish1 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Jayon02 commented Apr 21, 2026 •

edited

Loading

Possible fix: add `error_code` to `CompleteMessage`