是否已有关于该错误的issue或讨论? | Is there an existing issue / discussion for this?
该问题是否在FAQ中有解答? | Is there an existing answer for this in FAQ?
当前行为 | Current Behavior
Summary
We are seeing effectively serialized query handling with Qwen 3.5 models in Ollama, even on high-memory Apple Silicon systems where single-stream performance is very strong. The result is that Qwen delivers excellent per-request speed, but poor true multi-user concurrency in a single server process.
Environment
Model family: Qwen 3.5
Models tested: 35B and 122B variants
Runtime: Ollama
Backend: MLX / Apple Silicon
Deployment type: private multi-user serving environment
Front end: Open WebUI
Observed behavior
Individual query speed is excellent.
Under multiple simultaneous user requests, generation appears to serialize within a single Ollama server instance.
This creates a “pseudo-concurrency” effect where the system still feels responsive because Qwen is fast, but requests are not actually being served in parallel in the way needed for efficient multi-user production use.
Ollama developers have indicated that the affected models have architectures that prevent parallel queries within a single Ollama server process, and that the workaround is to run multiple Ollama servers behind a reverse proxy.
That workaround duplicates model weights in memory, reducing memory available for context and making it much less efficient operationally.
Expected behavior
Better support for true concurrent inference within a single server instance, or model architectural characteristics that allow runtime schedulers to serve multiple requests more efficiently without duplicating weights.
Why this matters
Qwen’s token speed and response quality are excellent, and that makes the model highly attractive for production use. However, in a shared multi-user environment, concurrency characteristics are just as important as raw quality and speed. Requiring multiple independent server instances to achieve concurrency is memory-expensive and reduces deployability.
Impact
Limits efficient multi-user serving
Forces a tradeoff between concurrency and context capacity
Reduces the operational advantage of high-memory systems
Makes production scaling less elegant than it could be
Request
Are there plans to improve concurrency characteristics in future Qwen architectures, especially for shared-weight, single-server, multi-request inference scenarios?
If there are recommended model settings, architectural notes, or future roadmap items related to concurrency-friendly serving, that information would be very helpful.
Additional note
This is not a complaint about output quality or speed. Qwen performs very well on both. This is specifically feedback about deployment behavior in real multi-user serving environments.
期望行为 | Expected Behavior
No response
复现方法 | Steps To Reproduce
No response
运行环境 | Environment
- OS:
- Python:
- Transformers:
- PyTorch:
- CUDA (`python -c 'import torch; print(torch.version.cuda)'`):
备注 | Anything else?
No response
是否已有关于该错误的issue或讨论? | Is there an existing issue / discussion for this?
该问题是否在FAQ中有解答? | Is there an existing answer for this in FAQ?
当前行为 | Current Behavior
Summary
We are seeing effectively serialized query handling with Qwen 3.5 models in Ollama, even on high-memory Apple Silicon systems where single-stream performance is very strong. The result is that Qwen delivers excellent per-request speed, but poor true multi-user concurrency in a single server process.
Environment
Model family: Qwen 3.5
Models tested: 35B and 122B variants
Runtime: Ollama
Backend: MLX / Apple Silicon
Deployment type: private multi-user serving environment
Front end: Open WebUI
Observed behavior
Individual query speed is excellent.
Under multiple simultaneous user requests, generation appears to serialize within a single Ollama server instance.
This creates a “pseudo-concurrency” effect where the system still feels responsive because Qwen is fast, but requests are not actually being served in parallel in the way needed for efficient multi-user production use.
Ollama developers have indicated that the affected models have architectures that prevent parallel queries within a single Ollama server process, and that the workaround is to run multiple Ollama servers behind a reverse proxy.
That workaround duplicates model weights in memory, reducing memory available for context and making it much less efficient operationally.
Expected behavior
Better support for true concurrent inference within a single server instance, or model architectural characteristics that allow runtime schedulers to serve multiple requests more efficiently without duplicating weights.
Why this matters
Qwen’s token speed and response quality are excellent, and that makes the model highly attractive for production use. However, in a shared multi-user environment, concurrency characteristics are just as important as raw quality and speed. Requiring multiple independent server instances to achieve concurrency is memory-expensive and reduces deployability.
Impact
Limits efficient multi-user serving
Forces a tradeoff between concurrency and context capacity
Reduces the operational advantage of high-memory systems
Makes production scaling less elegant than it could be
Request
Are there plans to improve concurrency characteristics in future Qwen architectures, especially for shared-weight, single-server, multi-request inference scenarios?
If there are recommended model settings, architectural notes, or future roadmap items related to concurrency-friendly serving, that information would be very helpful.
Additional note
This is not a complaint about output quality or speed. Qwen performs very well on both. This is specifically feedback about deployment behavior in real multi-user serving environments.
期望行为 | Expected Behavior
No response
复现方法 | Steps To Reproduce
No response
运行环境 | Environment
备注 | Anything else?
No response