[Fix] Add thinker input length check and parameter passing#330
Conversation
Ratish1
left a comment
There was a problem hiding this comment.
Overall Looks good. Just a few comments below.
| except RuntimeError as exc: | ||
| if _is_bad_request_error(exc): | ||
| raise HTTPException(status_code=400, detail=str(exc)) from exc | ||
| raise HTTPException(status_code=500, detail=str(exc)) from exc |
There was a problem hiding this comment.
Found while reviewing. The PR adds a prompt-length guard for the thinker stage and needs to return HTTP 400 (not 500) when a request exceeds thinker_max_seq_len.
The current implementation applies a string checker _is_bad_request_error to decide the status code. This works, but it's fragile — and the root cause is in the pipeline's error propagation, which raise a topic worth considering.
Did a bit investigation here:
How errors propagate today
engine_io.py raise ValueError("Prompt length 5000 exceeds thinker_max_seq_len 4096")
↓
worker/runtime.py except Exception as e:
_send_failure(request_id, str(e)) # type info lost here
↓
CompleteMessage CompleteMessage(success=False, error="Prompt length 5000 exceeds ...")
↓
coordinator.py raise RuntimeError(msg.error) # rebuilt as RuntimeError
↓
openai_api.py if "exceeds thinker_max_seq_len" in str(exc) # string matching
→ 400
The worker serializes exceptions to plain strings (str(e)), and the coordinator rebuilds them all as RuntimeError. Original exception types are lost entirely.
Why it matters
- Fragile: Changing the error message wording silently breaks the status code mapping.
- Doesn't scale: Each new client-error type (400, 413, 429, etc.) needs another substring check. This will become spaghetti quickly.
- Affects streaming too:
stage/runtime.py:541,550reconstructs errors withRuntimeError(msg.error)— same pattern.
Possible fix: add error_code to CompleteMessage
# proto/messages.py
@dataclass
class CompleteMessage:
request_id: str
from_stage: str
success: bool
result: Any = None
error: str | None = None
error_code: str | None = None # e.g. "PROMPT_TOO_LONG"# engine_io.py — error source
class PromptTooLongError(ValueError):
error_code = "PROMPT_TOO_LONG"# worker/runtime.py — preserve the code through serialization
except Exception as e:
error_code = getattr(e, "error_code", None)
await self._send_failure(request_id, str(e), error_code=error_code)# coordinator.py — attach code to the rebuilt exception
exc = RuntimeError(msg.error or "Unknown error")
exc.error_code = msg.error_code
raise exc# openai_api.py — classify by code, not by string
except RuntimeError as exc:
if getattr(exc, "error_code", None) == "PROMPT_TOO_LONG":
raise HTTPException(status_code=400, detail=str(exc)) from exc
raise HTTPException(status_code=500, detail=str(exc)) from excBackward-compatible: error_code defaults to None, existing errors are unaffected.
Summarize Open questions
- Is the single string-match in PR [Fix] Add thinker input length check and parameter passing #330 acceptable for the time being?
- Should we define a
PipelineErrorCodeenum up front, or just use free-form strings until we have more error types?
95ce3e7 to
032d1cc
Compare
zhaochenyang20
left a comment
There was a problem hiding this comment.
int(thinker_max_seq_len) only casts the type; it does not reject zero or negatives. A user passing --thinker-max-seq-len 0 will see every request fail with "Prompt length N exceeds thinker_max_seq_len 0" — technically valid but unhelpful. Add a if thinker_max_seq_len <= 0: raise typer.BadParameter(...) guard
032d1cc to
abb6061
Compare
zhaochenyang20
left a comment
There was a problem hiding this comment.
As a wrapper around SGLang, sglang-omni's outward-facing contract (HTTP status codes, finish_reason, error semantics) must strictly match SGLang upstream — otherwise upper-layer clients switching between sglang and sglang-omni will hit silent divergences. I walked through SGLang source to establish the reference behavior for three scenarios:
| Scenario | SGLang behavior | Source |
|---|---|---|
(a) prompt_len > context_length |
tokenizer_manager._validate_one_request raises ValueError("The input (N tokens) is longer than the model's context length (M tokens).") → serving_base.py catches except ValueError uniformly → HTTP 400 |
managers/tokenizer_manager.py:758-771, entrypoints/openai/serving_base.py:127-132 |
(b) prompt_len + max_new_tokens > context_length |
Same _validate_one_request raises ValueError("Requested token count exceeds the model's maximum context length...") → HTTP 400 |
managers/tokenizer_manager.py:773-798 |
(c) Decode reaches max_new_tokens without EOS |
schedule_batch.py sets FINISH_LENGTH(length=max_new_tokens), commented # to match OpenAI API's return value → HTTP 200 + finish_reason="length" |
managers/schedule_batch.py:145-154, 1047-1051 |
| Scenario | This PR's behavior | Aligned with SGLang? |
|---|---|---|
| (a) | _validate_prompt_seq_len raises ValueError → worker stringifies → coordinator re-wraps as RuntimeError → _is_bad_request_error substring match → HTTP 400 (non-streaming only) |
Semantically yes, but implementation uses substring match instead of exception class (SGLang uses except ValueError blanket mapping), and streaming does not get 400 |
| (b) | Not handled at all. PR only validates the prompt side, not prompt + max_new_tokens. If a user's prompt is near thinker_max_seq_len, adding max_new_tokens may still hit SGLang's own KV-cache ceiling; behavior is unclear (SGLang may truncate again or fail) |
Not aligned |
| (c) | Relies on SGLang's own FINISH_LENGTH; sglang-omni just needs to propagate. But must be tested |
Needs a test proving finish_reason="length" propagates correctly from thinker all the way to HTTP response |
-
Add (b) validation at the entry point:
_validate_prompt_seq_lenshould also acceptmax_new_tokensand checkprompt_len + max_new_tokens >= max_seq_len(note SGLang uses>=not>because it reserves headroom vianum_reserved_tokens); align the error message wording with SGLang's "Requested token count exceeds...". This makes scenario (b) return a clean 400. -
Test the (c) path:
finish_reason="length"must travel from the thinker stage'sARRequestData.finish_reason/ SGLangReq.finished_reasonall the way toCompletionResult.finish_reasonin the HTTP response.apply_thinker_result(engine_io.py:480) currently hardcodes"is_final": Truewithout readingFINISH_LENGTH— confirm this chain does not drop SGLang'sfinish_reason. -
Error propagation must use exception class, not substring: Ccyest's proposal is directionally right; long term, preserve exception class (which is what SGLang upstream does). Short term, at minimum consolidate
_is_bad_request_error's marker in one place and assert on the phrase, or useisinstance(original_error, ValueError)as the signal (requires worker to preserve exception type). -
Must add unit tests: all three scenarios must have unit tests that strictly assert HTTP status and finish_reason. These assertions should reference SGLang's corresponding case behavior as oracle — so that a behavior drift in this wrapper layer after a SGLang upgrade is caught immediately by CI.
| MIME_TO_FORMAT = {mime: fmt for fmt, mime in FORMAT_MIME_TYPES.items()} | ||
|
|
||
|
|
||
| def _is_bad_request_error(exc: Exception) -> bool: |
There was a problem hiding this comment.
_is_bad_request_error decides 400-vs-500 via "exceeds thinker_max_seq_len" in str(exc). Fragility is obvious: reword the phrase in _validate_prompt_seq_len and 400 silently downgrades to 500. Once P0 above adds the "Requested token count exceeds..." error from case (b), the current substring will miss it — another branch to add.
Root cause is error serialization at the worker → coordinator boundary (exception class lost). The correct long-term fix is Ccyest's CompleteMessage.error_code proposal, which is framework-level refactor — not in this PR. Short-term: add a TODO, file a follow-up issue, and consolidate the matched phrases into constants so the P0 additions are synchronized.
Fix:
# sglang_omni/serve/openai_api.py
_BAD_REQUEST_MARKERS = (
"longer than the", # case (a) — matches SGLang wording
"Requested token count exceeds", # case (b) — matches SGLang wording
)
def _is_bad_request_error(exc: Exception) -> bool:
# TODO(#<new-follow-up-issue>): replace with structured error code.
# Worker → coordinator currently serializes exceptions to str, so
# 400 vs 500 must be recovered via phrase match. See Ccyest's proposal
# on #330 for the end-to-end design (CompleteMessage.error_code).
# These markers must stay in sync with SGLang's ValueError wording:
# - managers/tokenizer_manager.py:761, 791
message = str(exc)
return any(marker in message for marker in _BAD_REQUEST_MARKERS)|
Thanks to all reviewers for the detailed feedback! At this point, I have only addressed the non-streaming path and added thinker input/output length handling. Since the project will likely go through a broader refactor soon, I have not tried to fully solve all of the remaining issues within this PR. Instead, I added follow-up notes in the codebase for the current gaps, including streaming error handling and a more structured error propagation path instead of the current string matching approach. |
Ratish1
left a comment
There was a problem hiding this comment.
Thanks for addressing the review comments. It makes sense for now to not address the streaming issue. I just have one more comment, lmk what do you think about it.
Ratish1
left a comment
There was a problem hiding this comment.
LGTM
cc: @zhaochenyang20
…ejects, robust test prompt - cli/serve.py: route --thinker-max-seq-len through apply_server_args_overrides so the CLI shares the same path as Qwen3OmniPipelineConfig / Qwen3OmniSpeechPipelineConfig; also raise typer.BadParameter when the pipeline has no thinker stage, matching the --thinker-mem-fraction-static precedent. - engine_io.py: extract _DEFAULT_THINKER_MAX_NEW_TOKENS so the validator and the actual generation can never drift; log rejected requests with request_id before raising so operators can attribute 400s. - test_qwen3_omni_thinker_length: prompt uses "a " * 10000 (space-separated) so BPE merges cannot collapse the count below THINKER_MAX_SEQ_LEN. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…thinker_max_seq_len=32768 and encoder reserve=0.20
Runs the 50-sample videomme-ci-50 subset at concurrency=4 with the
thinker-only server (--thinker-max-seq-len 32768 --encoder-mem-reserve
0.20) and asserts accuracy, zero failures, and per-concurrency speed
thresholds derived from a 5-run H200 calibration with apply_slack
(0.75/1.25). Accuracy floor is the worst-observed 0.54 with no slack;
any PR losing correct answers below that floor on a cold run fails
the test.
The calibration window is wider than an earlier snapshot (clustered
at {0.60, 0.60, 0.60, 0.60, 0.62}) because current main-line changes
since that snapshot (PR sgl-project#318 mem_fraction defaults, sgl-project#319 talker
micro-batching, sgl-project#330 thinker input-length check) altered internal
scheduling determinism. Speed metrics improved in the same window;
accuracy spread widened. The 0.54 floor reflects the worst observed
cold-run accuracy at the time of this commit.
The server fixture is module-scoped and passes both CLI flags so that
the test is pinned to the configuration that produced the calibration,
independent of future factory-default changes.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Runs the 50-sample videomme-ci-50 subset at concurrency=4 with the
thinker-only server (--thinker-max-seq-len 32768 --encoder-mem-reserve
0.20) and asserts accuracy, zero failures, and per-concurrency speed
thresholds derived from a 5-run H200 calibration with apply_slack
(0.75/1.25).
Thresholds (vs pre-rebase-snapshot values, both from 5 back-to-back
fresh-server pytest runs on the same H200):
VIDEOMME_MIN_ACCURACY 0.60 -> 0.54 (widened floor)
_VIDEOMME_P95.throughput 0.078 -> 0.084 (faster, tightens slack band)
_VIDEOMME_P95.toks_agg 2.3 -> 2.5 (faster, tightens slack band)
_VIDEOMME_P95.latency_s 50.3 -> 47.1 (faster, tightens slack band)
The accuracy floor dropped because current main widens the cold-run
accuracy spread from {0.60, 0.60, 0.60, 0.60, 0.62} to
{0.62, 0.54, 0.58, 0.62, 0.58} (correct in {27, 29, 29, 31, 31} / 50,
0 failed every run). Main-line changes that landed between the two
calibration windows (PR sgl-project#318 mem_fraction defaults, sgl-project#319 talker
micro-batching, sgl-project#330 thinker input-length check) altered internal
scheduling determinism; speed metrics improved in the same window
while accuracy spread widened. 0.54 is the worst-observed cold-run
accuracy and is enforced with no slack — any PR that loses a correct
answer below that floor fails the test.
The server fixture is module-scoped and passes both CLI flags so that
the test is pinned to the configuration that produced the calibration,
independent of future factory-default changes.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Motivation
Modifications
thinker_max_seq_lenin server_args_overrides and propagate it into the thinker stage config.thinker_max_seq_len.thinker_max_seq_leninto both thinker request paths so the new length guard uses the configured limit consistently.video_fpsfor video requests in the Qwen3-Omni preprocessor.thinker_max_seq_lenthrough the existing override path instead of patching the thinker stage config manually after construction.--thinker-max-seq-lenon the unified server CLI and apply it to the thinker stage config during server startup.overridepath and direct CLI parameter passing.thinker_max_seq_len, returning 400 Bad Request.Related Issues
#327 phase 1: Bug fix.
Accuracy Test
Functionality Test
Benchmark & Profiling
Checklist