Linux vs Windows for LLM inference #581

VladimirVLF · 2025-11-18T13:15:33Z

VladimirVLF
Nov 18, 2025

I got an AI-ready laptop and wanted to figure out how much of AI is possible to squeeze out of it. Here I show the answer I got.

One of the questions to answer first is: In which OS it is more efficient to run LLM inference, in Linux or Windows?

Now that Lemonade is available both in Linux and Windows, it is possible to run this analysis at least for the llamacpp backend with Vulkan (no ROCm for Strix Point on my laptop) in both OS. NPU is not yet fully supported in Linux (it is coming, see Ryzen AI SW v1.6.1), so the NPU tests will wait.

To answer my question I installed Ubuntu 24.04 and Windows 11 on my laptop and created an NTFS partition shared between the two OSs to store the LLM models and use from both OS. NOTE: In this configuration download the models from Windows, since Linux can use symlinks that will likely not work in Windows as intended.

Parameters

Some of them are not very relevant:

Chip: AMD Ryzen AI 7 Pro 360 (Strix Point) with iGPU Radeon 880M
RAM: 64 GB shared, with max 32 GB available to iGPU, unless tweaked
Windows 11 Pro
- 25H2 26200.6899
- GPU driver: 32.0.22021.1009
- NPU driver: 32.0.203.311
Ubuntu 24.04.3
- Kernel: 6.14.0
- XRT: 2.20.0
- amdxdna: 2.20.0_20250707
Lemonade v9.0.2
- llamacpp backend with Vulkan

I tested 3 LLMs:

Qwen3-8B-GGUF
Qwen3-Coder-30B-A3B-Instruct-GGUF
GPT-OSS-20B-MXFP4-GGUF

The two parameters I looked at are:

Memory consumption and maximum usable context window
Inference speed in tokens per second (TPS)

Results: RAM and context windows size

On Windows there is some memory overhead, therefore Qwen3-Coder could only be used with 64k context window - it could not load the model with 128k due to not enough RAM. Other models could be used with 128k.

On Linux I didn't notice this kind of memory overhead. What was used by GPU was pretty much reflected in the total RAM. So all models were used with 128k context window.

Memory in GB, format is GPU RAM(Total RAM)-context size:

                  Windows 11      Ubuntu 24.04
Qwen3-8B          25(28)-128k     25(26)-128k
Qwen3-Coder-30B   25(41)-64k      31(32)-128k
GPT-OSS-20B       16(25)-128k     16(16)-128k

Results: Tokens Per Second (TPS)

TPS results:

                  Windows 11     Ubuntu 24.04
Qwen3-8B             12.9            16.2
Qwen3-Coder-30B      22.8            34.2
GPT-OSS-20B          14.8            27.8

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Linux vs Windows for LLM inference #581

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Replies: 0 comments

Select a reply

Uh oh!

Linux vs Windows for LLM inference #581

Uh oh!

Uh oh!

VladimirVLF Nov 18, 2025

Parameters

Results: RAM and context windows size

Results: Tokens Per Second (TPS)

Replies: 0 comments

VladimirVLF
Nov 18, 2025