diffusion-cpp: accelerate Flux2 and Stable Diffusion Metal inference#2044
Merged
gianni-cor merged 23 commits intoMay 22, 2026
Conversation
Co-authored-by: Cursor <cursoragent@cursor.com>
…message) Co-authored-by: Cursor <cursoragent@cursor.com>
Point the diffusion ggml overlay at the PR commit with the RoPE Flux and conv2d direct safety fixes. Co-authored-by: Cursor <cursoragent@cursor.com>
Co-authored-by: Cursor <cursoragent@cursor.com>
Co-authored-by: Cursor <cursoragent@cursor.com>
Resolve diffusion-cpp conflicts by keeping PR2044's Metal conditioner behavior while carrying forward upstream integration-test cleanup. Co-authored-by: Cursor <cursoragent@cursor.com>
Default diffusion and VAE direct convolution paths to enabled while preserving explicit false overrides. Co-authored-by: Cursor <cursoragent@cursor.com>
Point the diffusion-cpp ggml overlay at the latest qvac-ext-ggml PR commit so PR2044 picks up the newest Metal RoPE dispatch guard. Co-authored-by: Cursor <cursoragent@cursor.com>
aegioscy
previously approved these changes
May 20, 2026
Co-authored-by: Cursor <cursoragent@cursor.com>
Consume the merged ggml and stable-diffusion-cpp registry pins for the Flux optimization stack and remove the temporary in-package overlay ports. Co-authored-by: Cursor <cursoragent@cursor.com>
Mark the merged registry-port migration as the 0.9.0 diffusion-cpp release. Co-authored-by: Cursor <cursoragent@cursor.com>
Contributor
Mobile integration tests — @qvac/diffusion-cpp (iOS)Result: passed
|
Contributor
Mobile integration tests — @qvac/diffusion-cpp (Android)Result: passed
|
Co-authored-by: Cursor <cursoragent@cursor.com>
Co-authored-by: Cursor <cursoragent@cursor.com>
jpgaribotti
approved these changes
May 21, 2026
aegioscy
approved these changes
May 22, 2026
gabrielgrigoras-serv
approved these changes
May 22, 2026
Contributor
Author
|
/review |
Contributor
Tier-based Approval Status |
2 tasks
Contributor
Author
|
/review |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What problem does this PR solve?
diffusion-cppon Apple Silicon: up to 7.27x faster total generation in txt2img and 6.05x faster total generation in i2i on M3 Ultra.vae_conv_direct=true: VAE decode buffer drops by 58% versus the generic conv path.ggmlandstable-diffusion.cppbranches before those dependencies are promoted through the registry.How does it solve it?
ggmlandstable-diffusion-cppunderpackages/diffusion-cpp.ggmlat the branch under review in tetherto/qvac-ext-ggml#9, which includes fused Flux RoPE/permute, Metal conv2d direct, and related Metal fixes.stable-diffusion-cppat the matching Flux2 integration branch so the addon actually uses the optimized Flux2 graph paths.keepClipOnCpuconfig on Apple instead of forcing the conditioner to CPU, which is the main conditioner speedup source.Dependency PRs
ggml: tetherto/qvac-ext-ggml#9 - Metal fused Flux RoPE, permute, conv2d direct, and related fixes.stable-diffusion.cpp: tetherto/qvac-ext-stable-diffusion.cpp#5 - Flux RoPE integration and optimized Flux2 graph path usage.Benchmark setup
qvac-dev-mac-arm64flux-2-klein-4b-Q8_0.ggufQwen3-4B-Q4_K_M.ggufdiffusion_pytorch_model.safetensors2423.53generations after one model loaddevice: gpu,fa: true,diffusion_fa: true,diffusion_conv_direct: true,threads: 4qwen3= conditioner,flux= denoise, firstvae= encode, secondvae= decodeRecap: speed and memory impact
main->PR2044main->PR2044main->PR2044main->PR2044main->PR2044main->PR2044main->PR2044main->PR2044vae_conv_direct=false->truevae_conv_direct=false->trueWhy these improve:
vae_conv_direct=true, because the direct conv2d path avoids the large im2col-style intermediate buffers used by the generic path.vae_conv_directaffects VAE buffers/time only; conditioner and denoise buffers stay unchanged in the benchmark.Text-to-image: main vs PR2044,
vae_conv_direct=trueModality: text-to-image (
txt2img) with no init image.mainPR2044mainPR2044Image-to-image: main vs PR2044,
vae_conv_direct=trueModality: FLUX2 image-to-image through in-context conditioning via single
init_image/ref_images, not SDEdit-style init-image noising.mainPR2044mainPR2044Image-to-image: PR2044,
vae_conv_directtogglevae_conv_directtruefalsetruefalseStable Diffusion models: main vs PR2044
Modality: text-to-image (
txt2img) using the existing SD examples. SD2.1 usesstable-diffusion-v2-1-Q8_0.gguf,768x768,5steps,cfg_scale: 7.5,prediction: v. SD3 Medium usessd3_medium_incl_clips.safetensors,512x512,28steps,cfg_scale: 5.0,sampling_method: euler,prediction: flow,flow_shift: 3.0. SDXL usesstable-diffusion-xl-base-1.0-Q4_0.gguf,512x512,5steps,cfg_scale: 6.5.768x768, 5 stepsmain768x768, 5 stepsPR2044512x512, 28 stepsmain512x512, 28 stepsPR2044512x512, 5 stepsmain512x512, 5 stepsPR2044Output Quality Sanity Check
Settings: one
txt2imggeneration per branch using the same prompt, seed, model files, and quality-oriented params. Outputs are saved underquality-images/; side-by-side contact sheets are inquality-images/contact/.512x512, 12 steps,guidance: 3.5,cfg_scale: 1.0,prediction: flux2_flow, seed42768x768, 20 steps,cfg_scale: 7.5,prediction: v, seed42512x512, 28 steps,cfg_scale: 5.0,sampling_method: euler,prediction: flow,flow_shift: 3.0, seed42512x512, 30 steps,cfg_scale: 6.5, seed15512x512; this checks regression/artifacts, not full native SDXL 1024 quality.Breaking changes
None.