-
Notifications
You must be signed in to change notification settings - Fork 33
Description
Which version of LM Studio?
Example: LM Studio 0.4.6
Which operating system?
• User Hardware:
◦ OS: Win11
◦ CPU: Intel Core i7-12700
◦ RAM: 32 GB
◦ GPU: AMD Radeon RX 7900 XT
• Testing Software & Versions:
◦ LM Studio (using the ROCm backend of llamacpp 2.7.0)
◦ Ollama 0.18.0
◦ Standalone llamacpp executable (llama-b8327-bin-win-hip-radeon-x64)
What is the bug?
Bug Report
Basic Information
• Model: DeepSeek R1 Distill Qwen 14B (Known to have 48 layers)
Problem Description
There is a significant and unreasonable discrepancy in the generation token speed (Generation t/s) when the same model is run on different inference backends. The core observation is: When the GPU offloading layer count parameter (ngl) is set to the model's total number of layers (48), the generation speed is the slowest. Conversely, increasing the ngl parameter (beyond the model's total layers, e.g., 99) or adjusting it slightly (e.g., 49) results in a significant performance improvement. This strongly suggests a potential issue with the logic for setting the GPU offloading layer count or parameter passing when LM Studio calls llamacpp.
Detailed Test Data Comparison
Environment / Parameter Setting Prompt Processing Speed (Prompt t/s) Generation Speed (Generation t/s) Notes
Ollama 0.18.0 Not separately listed ~80 t/s (total speed) Used as a performance baseline.
LM Studio (llamacpp 2.7.0) Not separately listed ~50 t/s (total speed) Speed is significantly lower than Ollama, GPU Offload is 48.
Standalone llamacpp (ngl 48) 106.8 t/s 37.2 t/s Key Finding: Setting ngl equal to the model's total layers (48) yields the slowest generation speed.
Standalone llamacpp (ngl 99) 134.0 t/s 57.1 t/s Setting ngl to a value far exceeding the model's layers (99) improves generation speed by over 50%.
Standalone llamacpp (ngl 49) 140.5 t/s 56.6 t/s Setting ngl slightly above the model's layers (49) also results in a substantial speed boost, similar to ngl 99.