Skip to content

diffusion-cpp: accelerate Flux2 and Stable Diffusion Metal inference#2044

Merged
gianni-cor merged 23 commits into
tetherto:mainfrom
gianni-cor:feat/ggml-fused-rope-flux2
May 22, 2026
Merged

diffusion-cpp: accelerate Flux2 and Stable Diffusion Metal inference#2044
gianni-cor merged 23 commits into
tetherto:mainfrom
gianni-cor:feat/ggml-fused-rope-flux2

Conversation

@gianni-cor

@gianni-cor gianni-cor commented May 13, 2026

Copy link
Copy Markdown
Contributor

What problem does this PR solve?

  • Delivers a large Flux2 performance improvement for diffusion-cpp on Apple Silicon: up to 7.27x faster total generation in txt2img and 6.05x faster total generation in i2i on M3 Ultra.
  • Cuts major phase costs: conditioner is ~33x faster, denoise per step is up to 2.21x faster, and VAE decode is ~8.5x faster in the measured Flux2 runs.
  • Reduces VAE compute-buffer memory when vae_conv_direct=true: VAE decode buffer drops by 58% versus the generic conv path.
  • Validates these gains through local vcpkg overlays for the optimized ggml and stable-diffusion.cpp branches before those dependencies are promoted through the registry.

How does it solve it?

  • Adds local vcpkg overlay ports for ggml and stable-diffusion-cpp under packages/diffusion-cpp.
  • Points ggml at the branch under review in tetherto/qvac-ext-ggml#9, which includes fused Flux RoPE/permute, Metal conv2d direct, and related Metal fixes.
  • Points stable-diffusion-cpp at the matching Flux2 integration branch so the addon actually uses the optimized Flux2 graph paths.
  • Honors the addon keepClipOnCpu config on Apple instead of forcing the conditioner to CPU, which is the main conditioner speedup source.

Dependency PRs

Benchmark setup

setting value
Host qvac-dev-mac-arm64
Hardware Apple M3 Ultra
Diffusion model flux-2-klein-4b-Q8_0.gguf
Text/LLM model Qwen3-4B-Q4_K_M.gguf
VAE diffusion_pytorch_model.safetensors
Steps 2
Seed 42
Guidance 3.5
Runs 3 generations after one model load
Common flags device: gpu, fa: true, diffusion_fa: true, diffusion_conv_direct: true, threads: 4
Memory metric sd.cpp/ggml compute-buffer size logs: qwen3 = conditioner, flux = denoise, first vae = encode, second vae = decode

Recap: speed and memory impact

result 512 1024
txt2img total speedup, main -> PR2044 30.31s -> 4.17s, 7.27x 66.65s -> 13.40s, 4.97x
txt2img conditioner speedup, main -> PR2044 16.84s -> 0.51s, 33.0x 16.83s -> 0.51s, 33.0x
txt2img denoise/step speedup, main -> PR2044 3.11s -> 1.41s, 2.21x 9.78s -> 4.68s, 2.09x
txt2img VAE decode speedup, main -> PR2044 7.25s -> 0.86s, 8.43x 30.26s -> 3.53s, 8.57x
i2i total speedup, main -> PR2044 24.09s -> 3.98s, 6.05x 71.54s -> 15.81s, 4.52x
i2i conditioner speedup, main -> PR2044 8.41s -> 0.25s, 33.6x 8.41s -> 0.25s, 33.6x
i2i denoise/step speedup, main -> PR2044 2.64s -> 1.22s, 2.16x 9.93s -> 5.10s, 1.95x
i2i VAE decode speedup, main -> PR2044 7.26s -> 0.86s, 8.44x 30.29s -> 3.55s, 8.53x
PR2044 VAE encode buffer, vae_conv_direct=false -> true 848.5 MB -> 387.5 MB, -54% 3394 MB -> 1549 MB, -54%
PR2044 VAE decode buffer, vae_conv_direct=false -> true 1664.5 MB -> 704.5 MB, -58% 6658 MB -> 2818 MB, -58%

Why these improve:

  • Conditioner time drops because the addon no longer forces the Flux2 text/LLM conditioner to CPU on Apple; PR2044 keeps it on Metal.
  • Denoising gets faster from the optimized ggml/stable-diffusion.cpp Flux2 Metal path, including fused RoPE/permute and reduced copy overhead.
  • VAE gets faster and uses much less compute-buffer memory when vae_conv_direct=true, because the direct conv2d path avoids the large im2col-style intermediate buffers used by the generic path.
  • Toggling vae_conv_direct affects VAE buffers/time only; conditioner and denoise buffers stay unchanged in the benchmark.

Text-to-image: main vs PR2044, vae_conv_direct=true

Modality: text-to-image (txt2img) with no init image.

size branch total avg conditioner avg conditioner buffer denoise avg denoise / step denoise buffer VAE decode avg VAE decode buffer
512 main 30.31s 16.84s 74 MB RAM 6.22s 3.11s 368.94 MB VRAM 7.25s 704.5 MB VRAM
512 PR2044 4.17s 0.51s 75 MB VRAM 2.81s 1.41s 386.94 MB VRAM 0.86s 704.5 MB VRAM
1024 main 66.65s 16.83s 74 MB RAM 19.56s 9.78s 1105.44 MB VRAM 30.26s 2818 MB VRAM
1024 PR2044 13.40s 0.51s 75 MB VRAM 9.35s 4.68s 1078.44 MB VRAM 3.53s 2818 MB VRAM

Image-to-image: main vs PR2044, vae_conv_direct=true

Modality: FLUX2 image-to-image through in-context conditioning via single init_image / ref_images, not SDEdit-style init-image noising.

size branch total avg conditioner avg conditioner buffer denoise avg denoise / step denoise buffer VAE decode avg VAE encode buffer VAE decode buffer
512 main 24.09s 8.41s 74 MB RAM 5.28s 2.64s 604.44 MB VRAM 7.26s 387.5 MB VRAM 704.5 MB VRAM
512 PR2044 3.98s 0.25s 75 MB VRAM 2.43s 1.22s 634.44 MB VRAM 0.86s 387.5 MB VRAM 704.5 MB VRAM
1024 main 71.54s 8.41s 74 MB RAM 19.86s 9.93s 2047.44 MB VRAM 30.29s 1549 MB VRAM 2818 MB VRAM
1024 PR2044 15.81s 0.25s 75 MB VRAM 10.20s 5.10s 1996.44 MB VRAM 3.55s 1549 MB VRAM 2818 MB VRAM

Image-to-image: PR2044, vae_conv_direct toggle

size vae_conv_direct total avg conditioner avg conditioner buffer denoise avg denoise / step denoise buffer VAE decode avg VAE encode buffer VAE decode buffer
512 true 3.98s 0.25s 75 MB VRAM 2.43s 1.22s 634.44 MB VRAM 0.86s 387.5 MB VRAM 704.5 MB VRAM
512 false 5.04s 0.25s 75 MB VRAM 2.43s 1.22s 634.44 MB VRAM 1.60s 848.5 MB VRAM 1664.5 MB VRAM
1024 true 15.81s 0.25s 75 MB VRAM 10.20s 5.10s 1996.44 MB VRAM 3.55s 1549 MB VRAM 2818 MB VRAM
1024 false 19.99s 0.26s 75 MB VRAM 10.20s 5.10s 1996.44 MB VRAM 6.50s 3394 MB VRAM 6658 MB VRAM

Stable Diffusion models: main vs PR2044

Modality: text-to-image (txt2img) using the existing SD examples. SD2.1 uses stable-diffusion-v2-1-Q8_0.gguf, 768x768, 5 steps, cfg_scale: 7.5, prediction: v. SD3 Medium uses sd3_medium_incl_clips.safetensors, 512x512, 28 steps, cfg_scale: 5.0, sampling_method: euler, prediction: flow, flow_shift: 3.0. SDXL uses stable-diffusion-xl-base-1.0-Q4_0.gguf, 512x512, 5 steps, cfg_scale: 6.5.

model branch total avg conditioner avg conditioner buffer denoise avg denoise / step denoise buffer VAE decode avg VAE decode buffer
SD2.1 768x768, 5 steps main 47.80s 0.20s 1.89 MB RAM 30.64s 6.13s 247.94 MB VRAM 16.96s 1584.14 MB VRAM
SD2.1 768x768, 5 steps PR2044 6.00s 0.09s 1.89 MB VRAM 3.95s 0.79s 247.94 MB VRAM 1.96s 1584.14 MB VRAM
SD3 Medium 512x512, 28 steps main 19.39s 1.71s 1.42/2.35 MB RAM 10.38s 0.37s 66.4 MB VRAM 7.29s 704.25 MB VRAM
SD3 Medium 512x512, 28 steps PR2044 11.25s 0.16s 1.42/2.36 MB VRAM 10.24s 0.37s 66.4 MB VRAM 0.85s 704.25 MB VRAM
SDXL Base Q4_0 512x512, 5 steps main 22.84s 1.61s 1.42/2.35 MB RAM 13.92s 2.78s 70.67 MB VRAM 7.32s 960.06 MB VRAM
SDXL Base Q4_0 512x512, 5 steps PR2044 3.61s 0.13s 1.42/2.36 MB VRAM 2.61s 0.52s 70.67 MB VRAM 0.88s 960.06 MB VRAM

Output Quality Sanity Check

Settings: one txt2img generation per branch using the same prompt, seed, model files, and quality-oriented params. Outputs are saved under quality-images/; side-by-side contact sheets are in quality-images/contact/.

model quality params result
Flux2 Klein Q8_0 512x512, 12 steps, guidance: 3.5, cfg_scale: 1.0, prediction: flux2_flow, seed 42 Retuned from the original 4-step check. PR2044 and main both produce a coherent full-body fox with normal anatomy; no quality regression or artifact was visible.
SD2.1 Q8_0 768x768, 20 steps, cfg_scale: 7.5, prediction: v, seed 42 PR2044 is visually equivalent to main; composition, lighting, and fox anatomy match closely.
SD3 Medium 512x512, 28 steps, cfg_scale: 5.0, sampling_method: euler, prediction: flow, flow_shift: 3.0, seed 42 PR2044 is visually equivalent to main; no denoise or VAE artifact was visible.
SDXL Base Q4_0 512x512, 30 steps, cfg_scale: 6.5, seed 15 PR2044 is visually equivalent to main. The image is coherent but stylized at 512x512; this checks regression/artifacts, not full native SDXL 1024 quality.
all-models

Breaking changes

None.

@gianni-cor gianni-cor requested review from a team as code owners May 13, 2026 23:55
@gianni-cor gianni-cor added the verified Authorize secrets / label-gate in PR workflows label May 14, 2026
gianni-cor and others added 12 commits May 14, 2026 23:56
Co-authored-by: Cursor <cursoragent@cursor.com>
…message)

Co-authored-by: Cursor <cursoragent@cursor.com>
Point the diffusion ggml overlay at the PR commit with the RoPE Flux and conv2d direct safety fixes.

Co-authored-by: Cursor <cursoragent@cursor.com>
Co-authored-by: Cursor <cursoragent@cursor.com>
Co-authored-by: Cursor <cursoragent@cursor.com>
Resolve diffusion-cpp conflicts by keeping PR2044's Metal conditioner behavior while carrying forward upstream integration-test cleanup.

Co-authored-by: Cursor <cursoragent@cursor.com>
Default diffusion and VAE direct convolution paths to enabled while preserving explicit false overrides.

Co-authored-by: Cursor <cursoragent@cursor.com>
Point the diffusion-cpp ggml overlay at the latest qvac-ext-ggml PR commit so PR2044 picks up the newest Metal RoPE dispatch guard.

Co-authored-by: Cursor <cursoragent@cursor.com>
aegioscy
aegioscy previously approved these changes May 20, 2026
gianni-cor and others added 4 commits May 20, 2026 15:30
Co-authored-by: Cursor <cursoragent@cursor.com>
Consume the merged ggml and stable-diffusion-cpp registry pins for the Flux optimization stack and remove the temporary in-package overlay ports.

Co-authored-by: Cursor <cursoragent@cursor.com>
Mark the merged registry-port migration as the 0.9.0 diffusion-cpp release.

Co-authored-by: Cursor <cursoragent@cursor.com>
@github-actions

github-actions Bot commented May 21, 2026

Copy link
Copy Markdown
Contributor

Mobile integration tests — @qvac/diffusion-cpp (iOS)

Result: passed

metric value
Devices passed 2
Devices failed 0
Test cases total 6
Test cases passed 6
Test cases failed 0
Test cases skipped 0

View workflow run

@github-actions

github-actions Bot commented May 21, 2026

Copy link
Copy Markdown
Contributor

Mobile integration tests — @qvac/diffusion-cpp (Android)

Result: passed

metric value
Devices passed 3
Devices failed 0
Test cases total 9
Test cases passed 9
Test cases failed 0
Test cases skipped 0

View workflow run

gianni-cor and others added 2 commits May 21, 2026 21:41
Co-authored-by: Cursor <cursoragent@cursor.com>
Co-authored-by: Cursor <cursoragent@cursor.com>
@gianni-cor

Copy link
Copy Markdown
Contributor Author

/review

@github-actions

github-actions Bot commented May 22, 2026

Copy link
Copy Markdown
Contributor

Tier-based Approval Status

**PR Tier:** TIER1

**Current Status:** ✅ APPROVED

**Requirements:**
- 1 Team Member approval ✅ (2/1)
- 1 Team Lead OR Management approval ✅ (1/1)



---
*This comment is automatically updated when reviews change.*

@gianni-cor

Copy link
Copy Markdown
Contributor Author

/review

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

verified Authorize secrets / label-gate in PR workflows

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants