Skip to content

Qwen3 Omni speech pipeline crashes with concurrent requests (CUDA illegal memory access) #229

@zhaochenyang20

Description

@zhaochenyang20

Problem

This is possibly on #219. The Qwen3 Omni speech pipeline (run_qwen3_omni_speech_server.py) cannot handle concurrent requests from multiple clients. Sending even 2 simultaneous requests to the same server causes an unrecoverable CUDA illegal memory access, after which all subsequent requests fail with HTTP 500.

Reproduction

Start the server:

CUDA_VISIBLE_DEVICES=1,2,7 python examples/run_qwen3_omni_speech_server.py \
    --model-path Qwen/Qwen3-Omni-30B-A3B-Instruct \
    --gpu-thinker 0 --gpu-talker 1 --gpu-code-predictor 2 --gpu-code2wav 0 \
    --port 8000 --model-name qwen3-omni

Send 2 requests simultaneously (e.g., two eval scripts pointing at port 8000, or a single script with --concurrency 2). Both will fail.

Error

HTTP 500: {"detail":"CUDA error: an illegal memory access was encountered\n
Search for cudaErrorIllegalAddress in
https://docs.nvidia.com/cuda/cuda-runtime-api/group__CUDART__TYPES.html
for more information.\n
CUDA kernel errors might be asynchronously reported at some other API call,
so the stacktrace below might be incorrect.\n
For debugging consider passing CUDA_LAUNCH_BLOCKING=1\n
Compile with TORCH_USE_CUDA_DSA to enable device-side assertions.\n"}

After this error, the GPU state is permanently corrupted — all subsequent requests also fail with the same CUDA error. The server must be restarted.

Impact

This makes it impossible to:

  1. Run WER benchmarks with any request concurrency (even concurrency=2)
  2. Run two different evaluation scenarios (e.g., no-VC and with-VC) against the same server simultaneously
  3. Serve multiple users from a single Qwen3 Omni instance

Current Workaround

For benchmarking, we start two separate Qwen3 servers on different GPUs/ports/IPC paths (one per evaluation scenario). This wastes 6 GPUs instead of 3 and does not scale.

Context

Discovered during PR #223 (benchmark-redesign) testing. The existing fixes in that PR (radix cache disable for talker in stages.py, tensor clone in sglang_ar.py) address sequential-request stability but do not fix concurrent-request handling.

Environment

  • Model: Qwen/Qwen3-Omni-30B-A3B-Instruct
  • Server script: examples/run_qwen3_omni_speech_server.py
  • Pipeline: 9-stage multi-process speech pipeline via MultiProcessPipelineRunner
  • Hardware: NVIDIA H200 (144 GB)

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions