[Qwen3-Omni] Fix concurrent request CUDA crash in speech pipeline by ischencheng · Pull Request #240 · sgl-project/sglang-omni

ischencheng · 2026-03-31T03:22:54Z

Motivation

Closes #229. The Qwen3-Omni speech pipeline crashes with CUDA illegal memory access when handling concurrent TTS requests. Even 2 simultaneous requests corrupt GPU state, causing all subsequent requests to fail with HTTP 500 until server restart.

Two independent bugs contribute to this:

GPU inference race in code_predictor / code2wav: Both executors call run_in_executor on shared model instances without synchronization, allowing concurrent CUDA operations within the same process.
Tree-cache prefix vs input_embeds mismatch in talker prefill: When consecutive requests share the same prompt tokens, sglang's radix tree cache matches a prefix and reduces extend_input_len (e.g. 23 → 16). However, input_embeds is not sliced accordingly, causing a shape mismatch in the TVM store_cache kernel ("expected 23 but got 16") that corrupts GPU state.

Modifications

code_predictor_executor.py: Add asyncio.Lock to serialize GPU inference calls, preventing concurrent CUDA operations on the shared model.
code2wav_executor.py: Same asyncio.Lock serialization for the codec decoder.
sglang_ar.py: Slice input_embeds by prefix_indices in _rebuild_prefill_input_embeds so the embed count matches extend_input_len. Prefer the correctly-sliced version over sglang core's unsliced forward_batch.input_embeds for projected prefill.
tests/test_concurrent_tts.py: Add regression tests covering both fixes (CPU-only mocks, no GPU required).

Related Issues

Closes Qwen3 Omni speech pipeline crashes with concurrent requests (CUDA illegal memory access) #229

Testing Notes

Unit tests: pytest tests/test_concurrent_tts.py — 5 tests pass
- code_predictor GPU lock serialization
- code2wav GPU lock serialization
- prefill input_embeds prefix slicing (with/without cache hit, multi-request)
Integration: 4 concurrent TTS requests on live server — all return 200 with valid audio
Cache correctness: 3 sequential identical requests verified — cache-hit requests produce audio with same RMS quality as first (no-cache) request
Accuracy: No model-side logic changes; projected embeddings are deterministic for identical prompts, so tree-cache prefix reuse is semantically correct

…CUDA illegal memory access

…ze mismatch When multiple concurrent requests are submitted to the talker_ar sglang engine simultaneously, its KV cache operations fail with a batch_size mismatch (e.g., expected 92 but got 76), which then corrupts GPU state and triggers CUDA illegal memory access errors. Add an asyncio.Lock (engine_lock) that is acquired just before submitting a request to the sglang engine and released only after the engine finishes generating all tokens for that request (_await_result completes). This ensures requests are fed into the engine one at a time while still allowing concurrent thinker-chunk collection and downstream code_predictor/code2wav processing to proceed in parallel. Co-Authored-By: Claude Sonnet 4.6 <[email protected]>

…t CUDA crash When consecutive requests share the same prompt, sglang's radix tree cache matches a token prefix and reduces extend_input_len accordingly. However, input_embeds was not sliced to match, causing a size mismatch in the TVM store_cache kernel (e.g. "expected 23 but got 16") which corrupted GPU state and triggered CUDA illegal memory access for all subsequent requests. Fix by slicing input_embeds by prefix_indices in _rebuild_prefill_input_embeds, and preferring the correctly-sliced version over sglang core's unsliced forward_batch.input_embeds. Also removes the talker engine_lock workaround (commit 012c638) which is no longer needed now that the root cause is fixed — concurrent requests can again run in parallel through the talker engine. Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>

Add tests verifying that _rebuild_prefill_input_embeds correctly slices input_embeds by prefix_indices to match extend_input_len: - With tree cache prefix match (23 tokens, 7 cached → 16 returned) - Without prefix match (all rows returned) - Multiple requests with different prefix lengths Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>

…ring Rename from test_issue_229_concurrent_lock.py to better reflect the scope of tests (GPU lock serialization + prefill input_embeds prefix slicing). Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>

ischencheng and others added 6 commits March 30, 2026 16:35

fix: serialize GPU inference with asyncio.Lock to prevent concurrent …

7309e27

…CUDA illegal memory access

fix: extend GPU lock to Code2Wav executor

de2592d

ischencheng requested a review from shuaills as a code owner March 31, 2026 03:22

JingwenGu0829 mentioned this pull request Apr 1, 2026

Fix Qwen3 Omni Concurrency and Adds CI for WER and Performance #243

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Qwen3-Omni] Fix concurrent request CUDA crash in speech pipeline#240

[Qwen3-Omni] Fix concurrent request CUDA crash in speech pipeline#240
ischencheng wants to merge 6 commits intosgl-project:mainfrom
ischencheng:fix/issue-229-concurrent-lock

ischencheng commented Mar 31, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

ischencheng commented Mar 31, 2026

Motivation

Modifications

Related Issues

Testing Notes

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant