Skip to content

[Fix] Add thinker input length check and parameter passing#330

Merged
zhaochenyang20 merged 11 commits into
sgl-project:mainfrom
Jayon02:thinker_length_fixes
Apr 22, 2026
Merged

[Fix] Add thinker input length check and parameter passing#330
zhaochenyang20 merged 11 commits into
sgl-project:mainfrom
Jayon02:thinker_length_fixes

Conversation

@Jayon02

@Jayon02 Jayon02 commented Apr 21, 2026

Copy link
Copy Markdown
Collaborator

Motivation

Modifications

  • sglang_omni/models/qwen3_omni/config.py: support thinker_max_seq_len in server_args_overrides and propagate it into the thinker stage config.
  • sglang_omni/models/qwen3_omni/pipeline/engine_io.py: add an explicit prompt-length guard before thinker execution and raise a clear error when requests exceed thinker_max_seq_len.
  • sglang_omni/models/qwen3_omni/pipeline/stages.py: wire thinker_max_seq_len into both thinker request paths so the new length guard uses the configured limit consistently.
  • sglang_omni/models/qwen3_omni/components/preprocessor.py: set a default video_fps for video requests in the Qwen3-Omni preprocessor.
  • examples/run_qwen3_omni_server.py and examples/run_qwen3_omni_text_first.py: pass thinker_max_seq_len through the existing override path instead of patching the thinker stage config manually after construction.
  • sglang_omni/cli/serve.py: expose --thinker-max-seq-len on the unified server CLI and apply it to the thinker stage config during server startup.
  • Preserve both the override path and direct CLI parameter passing.
  • Add exception handling for requests that exceed thinker_max_seq_len, returning 400 Bad Request.

Related Issues

#327 phase 1: Bug fix.

Accuracy Test

Functionality Test

python -m sglang_omni.cli.cli serve   --model-path Qwen/Qwen3-Omni-30B-A3B-Instruct   --text-only   --thinker-max-seq-len 128   --port 8000

curl http://127.0.0.1:8000/v1/chat/completions   -H "Content-Type: application/json"   -d '{
    "model": "Qwen/Qwen3-Omni-30B-A3B-Instruct",
    "messages": [{
      "role": "user",
      "content": "Please repeat the following text exactly and then summarize it: a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a"
    }],
    "max_tokens": 16,
    "stream": false
  }'

Benchmark & Profiling

Checklist

  • Format your code according with pre-commit.
  • Add unit tests.
  • Update documentation / docstrings / example tutorials as needed.
  • Provide throughput / latency benchmark results and accuracy evaluation results as needed.
  • For reviewers: If you haven't made any contributions to this PR and are only assisting with merging the main branch, please remove yourself as a co-author when merging the PR.

@Ratish1 Ratish1 left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall Looks good. Just a few comments below.

Comment thread sglang_omni/models/qwen3_omni/config.py Outdated
Comment thread sglang_omni/serve/openai_api.py
Comment on lines +198 to +201
except RuntimeError as exc:
if _is_bad_request_error(exc):
raise HTTPException(status_code=400, detail=str(exc)) from exc
raise HTTPException(status_code=500, detail=str(exc)) from exc

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Found while reviewing. The PR adds a prompt-length guard for the thinker stage and needs to return HTTP 400 (not 500) when a request exceeds thinker_max_seq_len.

The current implementation applies a string checker _is_bad_request_error to decide the status code. This works, but it's fragile — and the root cause is in the pipeline's error propagation, which raise a topic worth considering.

Did a bit investigation here:

How errors propagate today

engine_io.py        raise ValueError("Prompt length 5000 exceeds thinker_max_seq_len 4096")
                          ↓
worker/runtime.py   except Exception as e:
                        _send_failure(request_id, str(e))        # type info lost here
                          ↓
CompleteMessage     CompleteMessage(success=False, error="Prompt length 5000 exceeds ...")
                          ↓
coordinator.py      raise RuntimeError(msg.error)                # rebuilt as RuntimeError
                          ↓
openai_api.py       if "exceeds thinker_max_seq_len" in str(exc) # string matching
                        → 400

The worker serializes exceptions to plain strings (str(e)), and the coordinator rebuilds them all as RuntimeError. Original exception types are lost entirely.

Why it matters

  • Fragile: Changing the error message wording silently breaks the status code mapping.
  • Doesn't scale: Each new client-error type (400, 413, 429, etc.) needs another substring check. This will become spaghetti quickly.
  • Affects streaming too: stage/runtime.py:541,550 reconstructs errors with RuntimeError(msg.error) — same pattern.

Possible fix: add error_code to CompleteMessage

# proto/messages.py
@dataclass
class CompleteMessage:
    request_id: str
    from_stage: str
    success: bool
    result: Any = None
    error: str | None = None
    error_code: str | None = None   # e.g. "PROMPT_TOO_LONG"
# engine_io.py — error source
class PromptTooLongError(ValueError):
    error_code = "PROMPT_TOO_LONG"
# worker/runtime.py — preserve the code through serialization
except Exception as e:
    error_code = getattr(e, "error_code", None)
    await self._send_failure(request_id, str(e), error_code=error_code)
# coordinator.py — attach code to the rebuilt exception
exc = RuntimeError(msg.error or "Unknown error")
exc.error_code = msg.error_code
raise exc
# openai_api.py — classify by code, not by string
except RuntimeError as exc:
    if getattr(exc, "error_code", None) == "PROMPT_TOO_LONG":
        raise HTTPException(status_code=400, detail=str(exc)) from exc
    raise HTTPException(status_code=500, detail=str(exc)) from exc

Backward-compatible: error_code defaults to None, existing errors are unaffected.

Summarize Open questions

  1. Is the single string-match in PR [Fix] Add thinker input length check and parameter passing #330 acceptable for the time being?
  2. Should we define a PipelineErrorCode enum up front, or just use free-form strings until we have more error types?

cc @zhaochenyang20

@zhaochenyang20 zhaochenyang20 left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

int(thinker_max_seq_len) only casts the type; it does not reject zero or negatives. A user passing --thinker-max-seq-len 0 will see every request fail with "Prompt length N exceeds thinker_max_seq_len 0" — technically valid but unhelpful. Add a if thinker_max_seq_len <= 0: raise typer.BadParameter(...) guard

@zhaochenyang20 zhaochenyang20 left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As a wrapper around SGLang, sglang-omni's outward-facing contract (HTTP status codes, finish_reason, error semantics) must strictly match SGLang upstream — otherwise upper-layer clients switching between sglang and sglang-omni will hit silent divergences. I walked through SGLang source to establish the reference behavior for three scenarios:

Scenario SGLang behavior Source
(a) prompt_len > context_length tokenizer_manager._validate_one_request raises ValueError("The input (N tokens) is longer than the model's context length (M tokens).")serving_base.py catches except ValueError uniformly → HTTP 400 managers/tokenizer_manager.py:758-771, entrypoints/openai/serving_base.py:127-132
(b) prompt_len + max_new_tokens > context_length Same _validate_one_request raises ValueError("Requested token count exceeds the model's maximum context length...")HTTP 400 managers/tokenizer_manager.py:773-798
(c) Decode reaches max_new_tokens without EOS schedule_batch.py sets FINISH_LENGTH(length=max_new_tokens), commented # to match OpenAI API's return valueHTTP 200 + finish_reason="length" managers/schedule_batch.py:145-154, 1047-1051
Scenario This PR's behavior Aligned with SGLang?
(a) _validate_prompt_seq_len raises ValueError → worker stringifies → coordinator re-wraps as RuntimeError_is_bad_request_error substring match → HTTP 400 (non-streaming only) Semantically yes, but implementation uses substring match instead of exception class (SGLang uses except ValueError blanket mapping), and streaming does not get 400
(b) Not handled at all. PR only validates the prompt side, not prompt + max_new_tokens. If a user's prompt is near thinker_max_seq_len, adding max_new_tokens may still hit SGLang's own KV-cache ceiling; behavior is unclear (SGLang may truncate again or fail) Not aligned
(c) Relies on SGLang's own FINISH_LENGTH; sglang-omni just needs to propagate. But must be tested Needs a test proving finish_reason="length" propagates correctly from thinker all the way to HTTP response
  1. Add (b) validation at the entry point: _validate_prompt_seq_len should also accept max_new_tokens and check prompt_len + max_new_tokens >= max_seq_len (note SGLang uses >= not > because it reserves headroom via num_reserved_tokens); align the error message wording with SGLang's "Requested token count exceeds...". This makes scenario (b) return a clean 400.

  2. Test the (c) path: finish_reason="length" must travel from the thinker stage's ARRequestData.finish_reason / SGLang Req.finished_reason all the way to CompletionResult.finish_reason in the HTTP response. apply_thinker_result (engine_io.py:480) currently hardcodes "is_final": True without reading FINISH_LENGTH — confirm this chain does not drop SGLang's finish_reason.

  3. Error propagation must use exception class, not substring: Ccyest's proposal is directionally right; long term, preserve exception class (which is what SGLang upstream does). Short term, at minimum consolidate _is_bad_request_error's marker in one place and assert on the phrase, or use isinstance(original_error, ValueError) as the signal (requires worker to preserve exception type).

  4. Must add unit tests: all three scenarios must have unit tests that strictly assert HTTP status and finish_reason. These assertions should reference SGLang's corresponding case behavior as oracle — so that a behavior drift in this wrapper layer after a SGLang upgrade is caught immediately by CI.

MIME_TO_FORMAT = {mime: fmt for fmt, mime in FORMAT_MIME_TYPES.items()}


def _is_bad_request_error(exc: Exception) -> bool:

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

_is_bad_request_error decides 400-vs-500 via "exceeds thinker_max_seq_len" in str(exc). Fragility is obvious: reword the phrase in _validate_prompt_seq_len and 400 silently downgrades to 500. Once P0 above adds the "Requested token count exceeds..." error from case (b), the current substring will miss it — another branch to add.

Root cause is error serialization at the worker → coordinator boundary (exception class lost). The correct long-term fix is Ccyest's CompleteMessage.error_code proposal, which is framework-level refactor — not in this PR. Short-term: add a TODO, file a follow-up issue, and consolidate the matched phrases into constants so the P0 additions are synchronized.

Fix:

# sglang_omni/serve/openai_api.py
_BAD_REQUEST_MARKERS = (
    "longer than the",           # case (a) — matches SGLang wording
    "Requested token count exceeds",  # case (b) — matches SGLang wording
)

def _is_bad_request_error(exc: Exception) -> bool:
    # TODO(#<new-follow-up-issue>): replace with structured error code.
    # Worker → coordinator currently serializes exceptions to str, so
    # 400 vs 500 must be recovered via phrase match. See Ccyest's proposal
    # on #330 for the end-to-end design (CompleteMessage.error_code).
    # These markers must stay in sync with SGLang's ValueError wording:
    #   - managers/tokenizer_manager.py:761, 791
    message = str(exc)
    return any(marker in message for marker in _BAD_REQUEST_MARKERS)

Comment thread sglang_omni/serve/openai_api.py
@Jayon02

Jayon02 commented Apr 22, 2026

Copy link
Copy Markdown
Collaborator Author

Thanks to all reviewers for the detailed feedback!

At this point, I have only addressed the non-streaming path and added thinker input/output length handling. Since the project will likely go through a broader refactor soon, I have not tried to fully solve all of the remaining issues within this PR. Instead, I added follow-up notes in the codebase for the current gaps, including streaming error handling and a more structured error propagation path instead of the current string matching approach.

@Jayon02 Jayon02 requested a review from zhaochenyang20 April 22, 2026 12:04

@Ratish1 Ratish1 left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for addressing the review comments. It makes sense for now to not address the streaming issue. I just have one more comment, lmk what do you think about it.

Comment thread sglang_omni/models/qwen3_omni/pipeline/engine_io.py Outdated

@Ratish1 Ratish1 left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

cc: @zhaochenyang20

Comment thread sglang_omni/cli/serve.py Outdated
Comment thread sglang_omni/models/qwen3_omni/pipeline/engine_io.py Outdated
Comment thread sglang_omni/models/qwen3_omni/pipeline/engine_io.py
Comment thread tests/test_model/test_qwen3_omni_thinker_length.py Outdated
Comment thread tests/test_model/test_qwen3_omni_thinker_length.py
zhaochenyang20 and others added 3 commits April 22, 2026 18:23
…ejects, robust test prompt

- cli/serve.py: route --thinker-max-seq-len through apply_server_args_overrides
  so the CLI shares the same path as Qwen3OmniPipelineConfig / Qwen3OmniSpeechPipelineConfig;
  also raise typer.BadParameter when the pipeline has no thinker stage, matching
  the --thinker-mem-fraction-static precedent.
- engine_io.py: extract _DEFAULT_THINKER_MAX_NEW_TOKENS so the validator and the
  actual generation can never drift; log rejected requests with request_id before
  raising so operators can attribute 400s.
- test_qwen3_omni_thinker_length: prompt uses "a " * 10000 (space-separated) so
  BPE merges cannot collapse the count below THINKER_MAX_SEQ_LEN.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@zhaochenyang20 zhaochenyang20 merged commit 06afb6c into sgl-project:main Apr 22, 2026
21 of 23 checks passed
zhaochenyang20 added a commit to Jayon02/sglang-omni that referenced this pull request Apr 23, 2026
…thinker_max_seq_len=32768 and encoder reserve=0.20
zhaochenyang20 added a commit to Jayon02/sglang-omni that referenced this pull request Apr 24, 2026
Runs the 50-sample videomme-ci-50 subset at concurrency=4 with the
thinker-only server (--thinker-max-seq-len 32768 --encoder-mem-reserve
0.20) and asserts accuracy, zero failures, and per-concurrency speed
thresholds derived from a 5-run H200 calibration with apply_slack
(0.75/1.25). Accuracy floor is the worst-observed 0.54 with no slack;
any PR losing correct answers below that floor on a cold run fails
the test.

The calibration window is wider than an earlier snapshot (clustered
at {0.60, 0.60, 0.60, 0.60, 0.62}) because current main-line changes
since that snapshot (PR sgl-project#318 mem_fraction defaults, sgl-project#319 talker
micro-batching, sgl-project#330 thinker input-length check) altered internal
scheduling determinism. Speed metrics improved in the same window;
accuracy spread widened. The 0.54 floor reflects the worst observed
cold-run accuracy at the time of this commit.

The server fixture is module-scoped and passes both CLI flags so that
the test is pinned to the configuration that produced the calibration,
independent of future factory-default changes.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
zhaochenyang20 added a commit to Jayon02/sglang-omni that referenced this pull request Apr 24, 2026
Runs the 50-sample videomme-ci-50 subset at concurrency=4 with the
thinker-only server (--thinker-max-seq-len 32768 --encoder-mem-reserve
0.20) and asserts accuracy, zero failures, and per-concurrency speed
thresholds derived from a 5-run H200 calibration with apply_slack
(0.75/1.25).

Thresholds (vs pre-rebase-snapshot values, both from 5 back-to-back
fresh-server pytest runs on the same H200):

    VIDEOMME_MIN_ACCURACY     0.60  ->  0.54  (widened floor)
    _VIDEOMME_P95.throughput  0.078 ->  0.084 (faster, tightens slack band)
    _VIDEOMME_P95.toks_agg    2.3   ->  2.5   (faster, tightens slack band)
    _VIDEOMME_P95.latency_s   50.3  ->  47.1  (faster, tightens slack band)

The accuracy floor dropped because current main widens the cold-run
accuracy spread from {0.60, 0.60, 0.60, 0.60, 0.62} to
{0.62, 0.54, 0.58, 0.62, 0.58} (correct in {27, 29, 29, 31, 31} / 50,
0 failed every run). Main-line changes that landed between the two
calibration windows (PR sgl-project#318 mem_fraction defaults, sgl-project#319 talker
micro-batching, sgl-project#330 thinker input-length check) altered internal
scheduling determinism; speed metrics improved in the same window
while accuracy spread widened. 0.54 is the worst-observed cold-run
accuracy and is enforced with no slack — any PR that loses a correct
answer below that floor fails the test.

The server fixture is module-scoped and passes both CLI flags so that
the test is pinned to the configuration that produced the calibration,
independent of future factory-default changes.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants