[BUG] Poor single-server concurrency behavior in Qwen 3.5 under Ollama despite strong token speed

### 是否已有关于该错误的issue或讨论？ | Is there an existing issue / discussion for this?

- [x] 我已经搜索过已有的issues和讨论 | I have searched the existing issues / discussions

### 该问题是否在FAQ中有解答？ | Is there an existing answer for this in FAQ?

- [x] 我已经搜索过FAQ | I have searched FAQ

### 当前行为 | Current Behavior

Summary
We are seeing effectively serialized query handling with Qwen 3.5 models in Ollama, even on high-memory Apple Silicon systems where single-stream performance is very strong. The result is that Qwen delivers excellent per-request speed, but poor true multi-user concurrency in a single server process.

Environment

Model family: Qwen 3.5
Models tested: 35B and 122B variants
Runtime: Ollama
Backend: MLX / Apple Silicon
Deployment type: private multi-user serving environment
Front end: Open WebUI

Observed behavior

Individual query speed is excellent.
Under multiple simultaneous user requests, generation appears to serialize within a single Ollama server instance.
This creates a “pseudo-concurrency” effect where the system still feels responsive because Qwen is fast, but requests are not actually being served in parallel in the way needed for efficient multi-user production use.
Ollama developers have indicated that the affected models have architectures that prevent parallel queries within a single Ollama server process, and that the workaround is to run multiple Ollama servers behind a reverse proxy.
That workaround duplicates model weights in memory, reducing memory available for context and making it much less efficient operationally.

Expected behavior

Better support for true concurrent inference within a single server instance, or model architectural characteristics that allow runtime schedulers to serve multiple requests more efficiently without duplicating weights.

Why this matters
Qwen’s token speed and response quality are excellent, and that makes the model highly attractive for production use. However, in a shared multi-user environment, concurrency characteristics are just as important as raw quality and speed. Requiring multiple independent server instances to achieve concurrency is memory-expensive and reduces deployability.

Impact

Limits efficient multi-user serving
Forces a tradeoff between concurrency and context capacity
Reduces the operational advantage of high-memory systems
Makes production scaling less elegant than it could be

Request
Are there plans to improve concurrency characteristics in future Qwen architectures, especially for shared-weight, single-server, multi-request inference scenarios?

If there are recommended model settings, architectural notes, or future roadmap items related to concurrency-friendly serving, that information would be very helpful.

Additional note
This is not a complaint about output quality or speed. Qwen performs very well on both. This is specifically feedback about deployment behavior in real multi-user serving environments.

### 期望行为 | Expected Behavior

_No response_

### 复现方法 | Steps To Reproduce

_No response_

### 运行环境 | Environment

```Markdown
- OS:
- Python:
- Transformers:
- PyTorch:
- CUDA (`python -c 'import torch; print(torch.version.cuda)'`):
```

### 备注 | Anything else?

_No response_

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG] Poor single-server concurrency behavior in Qwen 3.5 under Ollama despite strong token speed #2155

是否已有关于该错误的issue或讨论？ | Is there an existing issue / discussion for this?

该问题是否在FAQ中有解答？ | Is there an existing answer for this in FAQ?

当前行为 | Current Behavior

期望行为 | Expected Behavior

复现方法 | Steps To Reproduce

运行环境 | Environment

备注 | Anything else?

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[BUG] Poor single-server concurrency behavior in Qwen 3.5 under Ollama despite strong token speed #2155

Description

是否已有关于该错误的issue或讨论？ | Is there an existing issue / discussion for this?

该问题是否在FAQ中有解答？ | Is there an existing answer for this in FAQ?

当前行为 | Current Behavior

期望行为 | Expected Behavior

复现方法 | Steps To Reproduce

运行环境 | Environment

备注 | Anything else?

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions