Linux vs Windows for LLM inference #581
VladimirVLF
started this conversation in
Show and tell
Replies: 0 comments
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
Uh oh!
There was an error while loading. Please reload this page.
-
I got an AI-ready laptop and wanted to figure out how much of AI is possible to squeeze out of it. Here I show the answer I got.
One of the questions to answer first is: In which OS it is more efficient to run LLM inference, in Linux or Windows?
Now that Lemonade is available both in Linux and Windows, it is possible to run this analysis at least for the llamacpp backend with Vulkan (no ROCm for Strix Point on my laptop) in both OS. NPU is not yet fully supported in Linux (it is coming, see Ryzen AI SW v1.6.1), so the NPU tests will wait.
To answer my question I installed Ubuntu 24.04 and Windows 11 on my laptop and created an NTFS partition shared between the two OSs to store the LLM models and use from both OS. NOTE: In this configuration download the models from Windows, since Linux can use symlinks that will likely not work in Windows as intended.
Parameters
Some of them are not very relevant:
I tested 3 LLMs:
The two parameters I looked at are:
Results: RAM and context windows size
On Windows there is some memory overhead, therefore Qwen3-Coder could only be used with 64k context window - it could not load the model with 128k due to not enough RAM. Other models could be used with 128k.
On Linux I didn't notice this kind of memory overhead. What was used by GPU was pretty much reflected in the total RAM. So all models were used with 128k context window.
Memory in GB, format is
GPU RAM(Total RAM)-context size:Results: Tokens Per Second (TPS)
TPS results:
Beta Was this translation helpful? Give feedback.
All reactions