Problem
This is possibly on #219. The Qwen3 Omni speech pipeline (run_qwen3_omni_speech_server.py) cannot handle concurrent requests from multiple clients. Sending even 2 simultaneous requests to the same server causes an unrecoverable CUDA illegal memory access, after which all subsequent requests fail with HTTP 500.
Reproduction
Start the server:
CUDA_VISIBLE_DEVICES=1,2,7 python examples/run_qwen3_omni_speech_server.py \
--model-path Qwen/Qwen3-Omni-30B-A3B-Instruct \
--gpu-thinker 0 --gpu-talker 1 --gpu-code-predictor 2 --gpu-code2wav 0 \
--port 8000 --model-name qwen3-omni
Send 2 requests simultaneously (e.g., two eval scripts pointing at port 8000, or a single script with --concurrency 2). Both will fail.
Error
HTTP 500: {"detail":"CUDA error: an illegal memory access was encountered\n
Search for cudaErrorIllegalAddress in
https://docs.nvidia.com/cuda/cuda-runtime-api/group__CUDART__TYPES.html
for more information.\n
CUDA kernel errors might be asynchronously reported at some other API call,
so the stacktrace below might be incorrect.\n
For debugging consider passing CUDA_LAUNCH_BLOCKING=1\n
Compile with TORCH_USE_CUDA_DSA to enable device-side assertions.\n"}
After this error, the GPU state is permanently corrupted — all subsequent requests also fail with the same CUDA error. The server must be restarted.
Impact
This makes it impossible to:
- Run WER benchmarks with any request concurrency (even
concurrency=2)
- Run two different evaluation scenarios (e.g., no-VC and with-VC) against the same server simultaneously
- Serve multiple users from a single Qwen3 Omni instance
Current Workaround
For benchmarking, we start two separate Qwen3 servers on different GPUs/ports/IPC paths (one per evaluation scenario). This wastes 6 GPUs instead of 3 and does not scale.
Context
Discovered during PR #223 (benchmark-redesign) testing. The existing fixes in that PR (radix cache disable for talker in stages.py, tensor clone in sglang_ar.py) address sequential-request stability but do not fix concurrent-request handling.
Environment
- Model: Qwen/Qwen3-Omni-30B-A3B-Instruct
- Server script:
examples/run_qwen3_omni_speech_server.py
- Pipeline: 9-stage multi-process speech pipeline via
MultiProcessPipelineRunner
- Hardware: NVIDIA H200 (144 GB)
Problem
This is possibly on #219. The Qwen3 Omni speech pipeline (
run_qwen3_omni_speech_server.py) cannot handle concurrent requests from multiple clients. Sending even 2 simultaneous requests to the same server causes an unrecoverable CUDA illegal memory access, after which all subsequent requests fail with HTTP 500.Reproduction
Start the server:
CUDA_VISIBLE_DEVICES=1,2,7 python examples/run_qwen3_omni_speech_server.py \ --model-path Qwen/Qwen3-Omni-30B-A3B-Instruct \ --gpu-thinker 0 --gpu-talker 1 --gpu-code-predictor 2 --gpu-code2wav 0 \ --port 8000 --model-name qwen3-omniSend 2 requests simultaneously (e.g., two eval scripts pointing at port 8000, or a single script with
--concurrency 2). Both will fail.Error
After this error, the GPU state is permanently corrupted — all subsequent requests also fail with the same CUDA error. The server must be restarted.
Impact
This makes it impossible to:
concurrency=2)Current Workaround
For benchmarking, we start two separate Qwen3 servers on different GPUs/ports/IPC paths (one per evaluation scenario). This wastes 6 GPUs instead of 3 and does not scale.
Context
Discovered during PR #223 (benchmark-redesign) testing. The existing fixes in that PR (radix cache disable for talker in
stages.py, tensor clone insglang_ar.py) address sequential-request stability but do not fix concurrent-request handling.Environment
examples/run_qwen3_omni_speech_server.pyMultiProcessPipelineRunner