[Qwen3-Omni] Fix concurrent request CUDA crash in speech pipeline#240
Open
ischencheng wants to merge 6 commits intosgl-project:mainfrom
Open
[Qwen3-Omni] Fix concurrent request CUDA crash in speech pipeline#240ischencheng wants to merge 6 commits intosgl-project:mainfrom
ischencheng wants to merge 6 commits intosgl-project:mainfrom
Conversation
…CUDA illegal memory access
…ze mismatch When multiple concurrent requests are submitted to the talker_ar sglang engine simultaneously, its KV cache operations fail with a batch_size mismatch (e.g., expected 92 but got 76), which then corrupts GPU state and triggers CUDA illegal memory access errors. Add an asyncio.Lock (engine_lock) that is acquired just before submitting a request to the sglang engine and released only after the engine finishes generating all tokens for that request (_await_result completes). This ensures requests are fed into the engine one at a time while still allowing concurrent thinker-chunk collection and downstream code_predictor/code2wav processing to proceed in parallel. Co-Authored-By: Claude Sonnet 4.6 <[email protected]>
…t CUDA crash When consecutive requests share the same prompt, sglang's radix tree cache matches a token prefix and reduces extend_input_len accordingly. However, input_embeds was not sliced to match, causing a size mismatch in the TVM store_cache kernel (e.g. "expected 23 but got 16") which corrupted GPU state and triggered CUDA illegal memory access for all subsequent requests. Fix by slicing input_embeds by prefix_indices in _rebuild_prefill_input_embeds, and preferring the correctly-sliced version over sglang core's unsliced forward_batch.input_embeds. Also removes the talker engine_lock workaround (commit 012c638) which is no longer needed now that the root cause is fixed — concurrent requests can again run in parallel through the talker engine. Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>
Add tests verifying that _rebuild_prefill_input_embeds correctly slices input_embeds by prefix_indices to match extend_input_len: - With tree cache prefix match (23 tokens, 7 cached → 16 returned) - Without prefix match (all rows returned) - Multiple requests with different prefix lengths Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>
…ring Rename from test_issue_229_concurrent_lock.py to better reflect the scope of tests (GPU lock serialization + prefill input_embeds prefix slicing). Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Motivation
Closes #229. The Qwen3-Omni speech pipeline crashes with CUDA illegal memory access when handling concurrent TTS requests. Even 2 simultaneous requests corrupt GPU state, causing all subsequent requests to fail with HTTP 500 until server restart.
Two independent bugs contribute to this:
run_in_executoron shared model instances without synchronization, allowing concurrent CUDA operations within the same process.extend_input_len(e.g. 23 → 16). However,input_embedsis not sliced accordingly, causing a shape mismatch in the TVMstore_cachekernel ("expected 23 but got 16") that corrupts GPU state.Modifications
code_predictor_executor.py: Addasyncio.Lockto serialize GPU inference calls, preventing concurrent CUDA operations on the shared model.code2wav_executor.py: Sameasyncio.Lockserialization for the codec decoder.sglang_ar.py: Sliceinput_embedsbyprefix_indicesin_rebuild_prefill_input_embedsso the embed count matchesextend_input_len. Prefer the correctly-sliced version over sglang core's unslicedforward_batch.input_embedsfor projected prefill.tests/test_concurrent_tts.py: Add regression tests covering both fixes (CPU-only mocks, no GPU required).Related Issues
Testing Notes
pytest tests/test_concurrent_tts.py— 5 tests pass