Skip to content

[WIP]feat: support flexible inference path for omni model#241

Open
Hangzhi wants to merge 1 commit intosgl-project:mainfrom
Hangzhi:feat/flexible-inference-path
Open

[WIP]feat: support flexible inference path for omni model#241
Hangzhi wants to merge 1 commit intosgl-project:mainfrom
Hangzhi:feat/flexible-inference-path

Conversation

@Hangzhi
Copy link
Copy Markdown

@Hangzhi Hangzhi commented Mar 31, 2026

Summary

Closes #61.

  • When the speech pipeline is deployed, requests with modalities=["text"] now skip audio stages entirely (talker_ar, code_predictor, code2wav) instead of running them and filtering at the API layer
  • Requests with modalities=["text", "audio"] or no modalities specified behave exactly as before
  • No configuration changes needed — the same pipeline serves both text-only and speech requests dynamically

Changes

  • Modality-aware routing (next_stage.py): New thinker_next_flexible checks request metadata to decide whether to fan out to audio stages
  • Per-request terminal tracking (coordinator.py): Coordinator now tracks expected terminal stages per request so text-only requests resolve after decode without waiting for code2wav
  • Stream suppression (runtime.py): Bootstrap, inter-stage streaming, and stream-done signals are suppressed for audio targets on text-only requests to avoid memory leaks
  • Pipeline config (config.py): Speech pipeline now uses the flexible routing function

Test plan

  • Unit tests for thinker_next_flexible with all modality combinations
  • Unit tests for coordinator per-request terminal resolution
  • All 33 existing + new tests pass

Allow per-request control over which output modalities are generated
in the speech pipeline. When a request specifies modalities=["text"],
the audio stages (talker_ar, code_predictor, code2wav) are skipped
entirely instead of running and being filtered at the API layer.
Copilot AI review requested due to automatic review settings March 31, 2026 06:06
@Hangzhi Hangzhi changed the title feat: support flexible inference path for omni model [WIP]feat: support flexible inference path for omni model Mar 31, 2026
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds per-request modality-aware routing to the Qwen3-Omni speech pipeline so text-only requests skip the audio stages (and can complete once decode finishes), while mixed text+audio requests keep the existing behavior.

Changes:

  • Introduces thinker_next_flexible routing based on request.metadata["output_modalities"].
  • Adds per-request terminal stage overrides in the coordinator so text-only requests can resolve after decode.
  • Suppresses streaming/bootstrap/stream-done signals toward audio targets for text-only requests to prevent per-request queue/pending-data leaks.

Reviewed changes

Copilot reviewed 5 out of 5 changed files in this pull request and generated 5 comments.

Show a summary per file
File Description
sglang_omni/models/qwen3_omni/pipeline/next_stage.py Adds modality-aware thinker fan-out (thinker_next_flexible).
sglang_omni/models/qwen3_omni/config.py Switches speech pipeline thinker routing to the flexible function.
sglang_omni/pipeline/coordinator.py Tracks per-request expected terminal stages; updates stream/completion resolution logic accordingly.
sglang_omni/pipeline/worker/runtime.py Suppresses streaming/bootstrap and stream-done toward audio targets for text-only requests.
tests/test_flexible_inference.py Adds unit tests for flexible routing and per-request terminal resolution.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

finally:
self._stream_queues.pop(request_id, None)
self._completion_futures.pop(request_id, None)
self._request_terminals.pop(request_id, None)
Copy link

Copilot AI Mar 31, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In stream(), the finally block unconditionally pops self._request_terminals[request_id]. If a client cancels/closes the stream generator early (e.g. disconnect), this removes the per-request terminal override while the request may still be running. For text-only requests in the speech pipeline, subsequent _handle_completion() calls will then fall back to the global terminal set (e.g. {"decode","code2wav"}) and can wait forever for code2wav that will never run, leaking _requests/_partial_results. Fix: don’t clear _request_terminals here (let _handle_completion()/abort() own cleanup), or abort the request before clearing to guarantee completion paths still resolve.

Suggested change
self._request_terminals.pop(request_id, None)

Copilot uses AI. Check for mistakes.
Comment on lines +188 to +198
# Compute per-request terminal stages when the multi-terminal speech
# pipeline is active but this particular request only needs text.
if len(self._terminal_stages) > 1:
modalities = request.metadata.get("output_modalities")
if modalities is not None and "audio" not in modalities:
# Only the text decode terminal is expected.
text_terminals = {
s for s in self._terminal_stages if s == "decode"
}
if text_terminals:
self._request_terminals[request_id] = text_terminals
Copy link

Copilot AI Mar 31, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The per-request terminal override logic is currently hard-coded to the string stage name "decode". Since Coordinator is pipeline-generic, this creates a hidden coupling to one specific pipeline naming scheme and will silently fail to override terminals in any pipeline whose text terminal is named differently. Consider deriving the text terminal from configuration (e.g., a constant exported by the pipeline config, or a mapping from modalities->terminal stage set) rather than embedding the stage name here.

Copilot uses AI. Check for mistakes.
Comment on lines +447 to +451
# Skip streaming to targets suppressed for this request (e.g.
# text-only request skipping audio stages in the speech pipeline).
suppressed = self._suppressed_stream_targets.get(request_id)
if suppressed and target_stage in suppressed:
return
Copy link

Copilot AI Mar 31, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Stream suppression is implemented for bootstrap and normal chunk enqueues, but error propagation still calls _notify_stream_error() which enqueues error signals to all _stream_targets. For a text-only request where audio targets are suppressed (and never receive a submit/data_ready), these error signals can be buffered in the downstream stage’s _pending_stream_data, recreating the kind of per-request leak this change is trying to avoid. Fix: apply the same suppression check when enqueuing done/error signals (e.g., in _enqueue_stream_done/_enqueue_stream_error and/or inside _notify_stream_error) so suppressed targets never receive any stream control messages.

Copilot uses AI. Check for mistakes.
Comment on lines +6 to +9
import asyncio
import unittest
from unittest.mock import AsyncMock, MagicMock

Copy link

Copilot AI Mar 31, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The test suite in this repo primarily uses pytest (fixtures, @pytest.mark.asyncio, etc.), but this new test file is written as unittest.TestCase. Pytest will run it, but it’s inconsistent with the existing test conventions and makes async testing/loop management more awkward. Consider rewriting these tests using pytest + pytest-asyncio for consistency with the rest of tests/.

Copilot uses AI. Check for mistakes.
Comment on lines +112 to +114
asyncio.get_event_loop().run_until_complete(
coord._submit_request("req-text", request)
)
Copy link

Copilot AI Mar 31, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These tests use asyncio.get_event_loop().run_until_complete(...). On newer Python versions this pattern is deprecated and can be flaky depending on event-loop policy; the rest of the repo generally uses asyncio.run(...) or @pytest.mark.asyncio async tests. Switching to asyncio.run(coord._submit_request(...)) (or making the test async under pytest) would avoid deprecation warnings and make loop ownership explicit.

Copilot uses AI. Check for mistakes.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Feature] Support flexible inference path for omni model

2 participants