Qwen3 Omni speech pipeline crashes with concurrent requests (CUDA illegal memory access)

### Problem

This is possibly on #219. The Qwen3 Omni speech pipeline (`run_qwen3_omni_speech_server.py`) cannot handle concurrent requests from multiple clients. Sending even 2 simultaneous requests to the same server causes an **unrecoverable CUDA illegal memory access**, after which all subsequent requests fail with HTTP 500.

### Reproduction

Start the server:

```bash
CUDA_VISIBLE_DEVICES=1,2,7 python examples/run_qwen3_omni_speech_server.py \
    --model-path Qwen/Qwen3-Omni-30B-A3B-Instruct \
    --gpu-thinker 0 --gpu-talker 1 --gpu-code-predictor 2 --gpu-code2wav 0 \
    --port 8000 --model-name qwen3-omni
```

Send 2 requests simultaneously (e.g., two eval scripts pointing at port 8000, or a single script with `--concurrency 2`). Both will fail.

### Error

```
HTTP 500: {"detail":"CUDA error: an illegal memory access was encountered\n
Search for cudaErrorIllegalAddress in
https://docs.nvidia.com/cuda/cuda-runtime-api/group__CUDART__TYPES.html
for more information.\n
CUDA kernel errors might be asynchronously reported at some other API call,
so the stacktrace below might be incorrect.\n
For debugging consider passing CUDA_LAUNCH_BLOCKING=1\n
Compile with TORCH_USE_CUDA_DSA to enable device-side assertions.\n"}
```

After this error, the GPU state is permanently corrupted — all subsequent requests also fail with the same CUDA error. The server must be restarted.

### Impact

This makes it impossible to:

1. Run WER benchmarks with any request concurrency (even `concurrency=2`)
2. Run two different evaluation scenarios (e.g., no-VC and with-VC) against the same server simultaneously
3. Serve multiple users from a single Qwen3 Omni instance

### Current Workaround

For benchmarking, we start two separate Qwen3 servers on different GPUs/ports/IPC paths (one per evaluation scenario). This wastes 6 GPUs instead of 3 and does not scale.

### Context

Discovered during PR #223 (benchmark-redesign) testing. The existing fixes in that PR (radix cache disable for talker in `stages.py`, tensor clone in `sglang_ar.py`) address sequential-request stability but do not fix concurrent-request handling.

### Environment

- **Model:** Qwen/Qwen3-Omni-30B-A3B-Instruct
- **Server script:** `examples/run_qwen3_omni_speech_server.py`
- **Pipeline:** 9-stage multi-process speech pipeline via `MultiProcessPipelineRunner`
- **Hardware:** NVIDIA H200 (144 GB)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Qwen3 Omni speech pipeline crashes with concurrent requests (CUDA illegal memory access) #229

Problem

Reproduction

Error

Impact

Current Workaround

Context

Environment

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Qwen3 Omni speech pipeline crashes with concurrent requests (CUDA illegal memory access) #229

Description

Problem

Reproduction

Error

Impact

Current Workaround

Context

Environment

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions