Win build improvements by aittalam · Pull Request #940 · mozilla-ai/llamafile

aittalam · 2026-04-09T12:49:46Z

A WIP PR to work on some improvements re: the windows support for GPU acceleration.

easier .bat script (with auto-detection of VS tooling)
fixes to build with CUDA13
fixes to ggml-vulkan (running single-threaded shaders compilation to avoid breaking issues with cosmocc+win)
hide low-level logging unless --verbose, made calls more consistent

aittalam · 2026-04-09T13:04:22Z

@wingenlit I think I might have been able to replicate your issue (vulkan build succeeded, GPU found, model loading but then coredump as soon as some inference was called). I left some notes here (I investigated the issue with Claude, so while the methodology was relatively sound -delving deeper and deeper into the error with debugging code- there might be some mistakes in how it explains things).

Vulkan now seems to be working (tested with llama-benchy), even if pp2048 is quite slower than CUDA.

uvx llama-benchy --base-url http://localhost:8080 --model Qwen3.5-9B_Q5_K_S --tokenizer Qwen/Qwen3.5-9B --runs 10

model	test	t/s	peak t/s	ttfr (ms)	est_ppt (ms)	e2e_ttft (ms)
Qwen3.5-9B_Q5_K_S Vulkan	pp2048	630.55 ± 42.61		3265.10 ± 206.56	3263.33 ± 206.56	3265.15 ± 206.56
Qwen3.5-9B_Q5_K_S Vulkan	tg32	48.27 ± 2.19	50.12 ± 2.70
Qwen3.5-9B_Q5_K_S Cuda	pp2048	1127.93 ± 101.46		1832.72 ± 162.31	1831.07 ± 162.31	1832.77 ± 162.30
Qwen3.5-9B_Q5_K_S Cuda	tg32	35.80 ± 0.61	37.01 ± 0.61

aittalam · 2026-04-17T11:01:51Z

Code review

Found 1 issue:

The new synchronous shader-compile path in ggml_vk_load_shaders removes compile_count++ before calling ggml_vk_create_pipeline_func, but the callee still runs assert(compile_count > 0); compile_count--; inside its lock. In debug builds the assert fires on the very first call; in release builds the uint32_t decrement underflows from 0 to UINT32_MAX, poisoning the counter for any future use. Either keep the compile_count++ in the caller, or (consistent with the deleted // TODO: We're no longer benefitting from the async compiles ... this complexity can be removed.) also patch ggml_vk_create_pipeline_func to drop the counter/mutex/cond-var bookkeeping.

llamafile/llama.cpp.patches/patches/ggml_src_ggml-vulkan_ggml-vulkan.cpp.patch

Lines 30 to 52 in ec56f41

    
           @@ -3391,20 +3391,9 @@ static void ggml_vk_load_shaders(vk_device& device) { 
        
                        if (!pipeline->needed || pipeline->compiled) { 
        
                            continue; 
        
                        } 
        
           -            // TODO: We're no longer benefitting from the async compiles (shaders are 
        
           -            // compiled individually, as needed) and this complexity can be removed. 
        
           -            { 
        
           -                // wait until fewer than N compiles are in progress 
        
           -                uint32_t N = std::max(1u, std::thread::hardware_concurrency()); 
        
           -                std::unique_lock<std::mutex> guard(compile_count_mutex); 
        
           -                while (compile_count >= N) { 
        
           -                    compile_count_cond.wait(guard); 
        
           -                } 
        
           -                compile_count++; 
        
           -            } 
        
           - 
        
           -            compiles.push_back(std::async(ggml_vk_create_pipeline_func, std::ref(device), std::ref(pipeline), spv_size, spv_data, entrypoint, 
        
           -                                          parameter_count, wg_denoms, specialization_constants, disable_robustness, require_full_subgroups, required_subgroup_size)); 
        
           +            // Compile synchronously to avoid threading issues in cross-module DLL loading 
        
           +            ggml_vk_create_pipeline_func(device, pipeline, spv_size, spv_data, entrypoint, 
        
           +                                         parameter_count, wg_denoms, specialization_constants, disable_robustness, require_full_subgroups, required_subgroup_size); 
        
                    } 
        
                };

🤖 Generated with Claude Code

_{- If this code review was useful, please react with 👍. Otherwise, react with 👎.}

aittalam · 2026-04-17T11:33:38Z

Code review

Found 1 issue:

Addessed in 86e8378

aittalam added 2 commits April 8, 2026 19:50

CUDA13 fixes, find vcvarsall.bat automatically

debd8a3

Got to a working (albeit slow) vulkan version on CUDA13

ace7f51

github-actions Bot added the llamafile label Apr 9, 2026

aittalam force-pushed the win_build_improvements branch from d25caed to ace7f51 Compare April 10, 2026 12:39

aittalam added 2 commits April 10, 2026 21:37

Merge with main

21f3378

Re-added synchronous compilation after merge

53d5c73

aittalam marked this pull request as ready for review April 17, 2026 08:07

aittalam added 3 commits April 17, 2026 09:27

Merge branch 'main' into win_build_improvements

daad749

Cosmetic fix to patch

84290ca

Rewrote info logs for consistency / to work with --verbose

ec56f41

Brought compile_count back in

86e8378

Updated docs removing ref to VS Developer Command Prompt

25b835e

github-actions Bot added the documentation label Apr 17, 2026

aittalam merged commit f643524 into main Apr 17, 2026
3 checks passed

aittalam deleted the win_build_improvements branch April 17, 2026 12:49

This was referenced Apr 17, 2026

Support CUDA13 build on Windows #937

Closed

Bug: Vulkan support on Windows does not work consistently #938

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Win build improvements#940

Win build improvements#940
aittalam merged 9 commits into
mainfrom
win_build_improvements

aittalam commented Apr 9, 2026 •

edited

Loading

Uh oh!

aittalam commented Apr 9, 2026 •

edited

Loading

Uh oh!

aittalam commented Apr 17, 2026

Uh oh!

aittalam commented Apr 17, 2026 •

edited

Loading

Code review

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

aittalam commented Apr 9, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

aittalam commented Apr 9, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

aittalam commented Apr 17, 2026

Code review

Uh oh!

aittalam commented Apr 17, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Code review

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

aittalam commented Apr 9, 2026 •

edited

Loading

aittalam commented Apr 9, 2026 •

edited

Loading

aittalam commented Apr 17, 2026 •

edited

Loading