Skip to content

[BUG] Poor single-server concurrency behavior in Qwen 3.5 under Ollama despite strong token speed #2155

@charlesdrakon-cmyk

Description

@charlesdrakon-cmyk

是否已有关于该错误的issue或讨论? | Is there an existing issue / discussion for this?

  • 我已经搜索过已有的issues和讨论 | I have searched the existing issues / discussions

该问题是否在FAQ中有解答? | Is there an existing answer for this in FAQ?

  • 我已经搜索过FAQ | I have searched FAQ

当前行为 | Current Behavior

Summary
We are seeing effectively serialized query handling with Qwen 3.5 models in Ollama, even on high-memory Apple Silicon systems where single-stream performance is very strong. The result is that Qwen delivers excellent per-request speed, but poor true multi-user concurrency in a single server process.

Environment

Model family: Qwen 3.5
Models tested: 35B and 122B variants
Runtime: Ollama
Backend: MLX / Apple Silicon
Deployment type: private multi-user serving environment
Front end: Open WebUI

Observed behavior

Individual query speed is excellent.
Under multiple simultaneous user requests, generation appears to serialize within a single Ollama server instance.
This creates a “pseudo-concurrency” effect where the system still feels responsive because Qwen is fast, but requests are not actually being served in parallel in the way needed for efficient multi-user production use.
Ollama developers have indicated that the affected models have architectures that prevent parallel queries within a single Ollama server process, and that the workaround is to run multiple Ollama servers behind a reverse proxy.
That workaround duplicates model weights in memory, reducing memory available for context and making it much less efficient operationally.

Expected behavior

Better support for true concurrent inference within a single server instance, or model architectural characteristics that allow runtime schedulers to serve multiple requests more efficiently without duplicating weights.

Why this matters
Qwen’s token speed and response quality are excellent, and that makes the model highly attractive for production use. However, in a shared multi-user environment, concurrency characteristics are just as important as raw quality and speed. Requiring multiple independent server instances to achieve concurrency is memory-expensive and reduces deployability.

Impact

Limits efficient multi-user serving
Forces a tradeoff between concurrency and context capacity
Reduces the operational advantage of high-memory systems
Makes production scaling less elegant than it could be

Request
Are there plans to improve concurrency characteristics in future Qwen architectures, especially for shared-weight, single-server, multi-request inference scenarios?

If there are recommended model settings, architectural notes, or future roadmap items related to concurrency-friendly serving, that information would be very helpful.

Additional note
This is not a complaint about output quality or speed. Qwen performs very well on both. This is specifically feedback about deployment behavior in real multi-user serving environments.

期望行为 | Expected Behavior

No response

复现方法 | Steps To Reproduce

No response

运行环境 | Environment

- OS:
- Python:
- Transformers:
- PyTorch:
- CUDA (`python -c 'import torch; print(torch.version.cuda)'`):

备注 | Anything else?

No response

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions