Skip to content

[Qwen3-Omni] Fix concurrent request CUDA crash in speech pipeline#240

Open
ischencheng wants to merge 6 commits intosgl-project:mainfrom
ischencheng:fix/issue-229-concurrent-lock
Open

[Qwen3-Omni] Fix concurrent request CUDA crash in speech pipeline#240
ischencheng wants to merge 6 commits intosgl-project:mainfrom
ischencheng:fix/issue-229-concurrent-lock

Conversation

@ischencheng
Copy link
Copy Markdown
Contributor

Motivation

Closes #229. The Qwen3-Omni speech pipeline crashes with CUDA illegal memory access when handling concurrent TTS requests. Even 2 simultaneous requests corrupt GPU state, causing all subsequent requests to fail with HTTP 500 until server restart.

Two independent bugs contribute to this:

  1. GPU inference race in code_predictor / code2wav: Both executors call run_in_executor on shared model instances without synchronization, allowing concurrent CUDA operations within the same process.
  2. Tree-cache prefix vs input_embeds mismatch in talker prefill: When consecutive requests share the same prompt tokens, sglang's radix tree cache matches a prefix and reduces extend_input_len (e.g. 23 → 16). However, input_embeds is not sliced accordingly, causing a shape mismatch in the TVM store_cache kernel ("expected 23 but got 16") that corrupts GPU state.

Modifications

  • code_predictor_executor.py: Add asyncio.Lock to serialize GPU inference calls, preventing concurrent CUDA operations on the shared model.
  • code2wav_executor.py: Same asyncio.Lock serialization for the codec decoder.
  • sglang_ar.py: Slice input_embeds by prefix_indices in _rebuild_prefill_input_embeds so the embed count matches extend_input_len. Prefer the correctly-sliced version over sglang core's unsliced forward_batch.input_embeds for projected prefill.
  • tests/test_concurrent_tts.py: Add regression tests covering both fixes (CPU-only mocks, no GPU required).

Related Issues

Testing Notes

  • Unit tests: pytest tests/test_concurrent_tts.py — 5 tests pass
    • code_predictor GPU lock serialization
    • code2wav GPU lock serialization
    • prefill input_embeds prefix slicing (with/without cache hit, multi-request)
  • Integration: 4 concurrent TTS requests on live server — all return 200 with valid audio
  • Cache correctness: 3 sequential identical requests verified — cache-hit requests produce audio with same RMS quality as first (no-cache) request
  • Accuracy: No model-side logic changes; projected embeddings are deterministic for identical prompts, so tree-cache prefix reuse is semantically correct

ischencheng and others added 6 commits March 30, 2026 16:35
…ze mismatch

When multiple concurrent requests are submitted to the talker_ar sglang
engine simultaneously, its KV cache operations fail with a batch_size
mismatch (e.g., expected 92 but got 76), which then corrupts GPU state
and triggers CUDA illegal memory access errors.

Add an asyncio.Lock (engine_lock) that is acquired just before submitting
a request to the sglang engine and released only after the engine finishes
generating all tokens for that request (_await_result completes). This
ensures requests are fed into the engine one at a time while still allowing
concurrent thinker-chunk collection and downstream code_predictor/code2wav
processing to proceed in parallel.

Co-Authored-By: Claude Sonnet 4.6 <[email protected]>
…t CUDA crash

When consecutive requests share the same prompt, sglang's radix tree cache
matches a token prefix and reduces extend_input_len accordingly. However,
input_embeds was not sliced to match, causing a size mismatch in the TVM
store_cache kernel (e.g. "expected 23 but got 16") which corrupted GPU state
and triggered CUDA illegal memory access for all subsequent requests.

Fix by slicing input_embeds by prefix_indices in _rebuild_prefill_input_embeds,
and preferring the correctly-sliced version over sglang core's unsliced
forward_batch.input_embeds.

Also removes the talker engine_lock workaround (commit 012c638) which is no
longer needed now that the root cause is fixed — concurrent requests can
again run in parallel through the talker engine.

Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>
Add tests verifying that _rebuild_prefill_input_embeds correctly slices
input_embeds by prefix_indices to match extend_input_len:
- With tree cache prefix match (23 tokens, 7 cached → 16 returned)
- Without prefix match (all rows returned)
- Multiple requests with different prefix lengths

Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>
…ring

Rename from test_issue_229_concurrent_lock.py to better reflect the scope
of tests (GPU lock serialization + prefill input_embeds prefix slicing).

Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Qwen3 Omni speech pipeline crashes with concurrent requests (CUDA illegal memory access)

1 participant