diffusion-cpp: accelerate Flux2 and Stable Diffusion Metal inference by gianni-cor · Pull Request #2044 · tetherto/qvac

gianni-cor · 2026-05-13T23:55:47Z

What problem does this PR solve?

Delivers a large Flux2 performance improvement for diffusion-cpp on Apple Silicon: up to 7.27x faster total generation in txt2img and 6.05x faster total generation in i2i on M3 Ultra.
Cuts major phase costs: conditioner is ~33x faster, denoise per step is up to 2.21x faster, and VAE decode is ~8.5x faster in the measured Flux2 runs.
Reduces VAE compute-buffer memory when vae_conv_direct=true: VAE decode buffer drops by 58% versus the generic conv path.
Validates these gains through local vcpkg overlays for the optimized ggml and stable-diffusion.cpp branches before those dependencies are promoted through the registry.

How does it solve it?

Adds local vcpkg overlay ports for ggml and stable-diffusion-cpp under packages/diffusion-cpp.
Points ggml at the branch under review in tetherto/qvac-ext-ggml#9, which includes fused Flux RoPE/permute, Metal conv2d direct, and related Metal fixes.
Points stable-diffusion-cpp at the matching Flux2 integration branch so the addon actually uses the optimized Flux2 graph paths.
Honors the addon keepClipOnCpu config on Apple instead of forcing the conditioner to CPU, which is the main conditioner speedup source.

Dependency PRs

ggml: tetherto/qvac-ext-ggml#9 - Metal fused Flux RoPE, permute, conv2d direct, and related fixes.
stable-diffusion.cpp: tetherto/qvac-ext-stable-diffusion.cpp#5 - Flux RoPE integration and optimized Flux2 graph path usage.

Benchmark setup

setting	value
Host	`qvac-dev-mac-arm64`
Hardware	Apple M3 Ultra
Diffusion model	`flux-2-klein-4b-Q8_0.gguf`
Text/LLM model	`Qwen3-4B-Q4_K_M.gguf`
VAE	`diffusion_pytorch_model.safetensors`
Steps	`2`
Seed	`42`
Guidance	`3.5`
Runs	`3` generations after one model load
Common flags	`device: gpu`, `fa: true`, `diffusion_fa: true`, `diffusion_conv_direct: true`, `threads: 4`
Memory metric	sd.cpp/ggml compute-buffer size logs: `qwen3` = conditioner, `flux` = denoise, first `vae` = encode, second `vae` = decode

Recap: speed and memory impact

result	512	1024
txt2img total speedup, `main` -> `PR2044`	30.31s -> 4.17s, 7.27x	66.65s -> 13.40s, 4.97x
txt2img conditioner speedup, `main` -> `PR2044`	16.84s -> 0.51s, 33.0x	16.83s -> 0.51s, 33.0x
txt2img denoise/step speedup, `main` -> `PR2044`	3.11s -> 1.41s, 2.21x	9.78s -> 4.68s, 2.09x
txt2img VAE decode speedup, `main` -> `PR2044`	7.25s -> 0.86s, 8.43x	30.26s -> 3.53s, 8.57x
i2i total speedup, `main` -> `PR2044`	24.09s -> 3.98s, 6.05x	71.54s -> 15.81s, 4.52x
i2i conditioner speedup, `main` -> `PR2044`	8.41s -> 0.25s, 33.6x	8.41s -> 0.25s, 33.6x
i2i denoise/step speedup, `main` -> `PR2044`	2.64s -> 1.22s, 2.16x	9.93s -> 5.10s, 1.95x
i2i VAE decode speedup, `main` -> `PR2044`	7.26s -> 0.86s, 8.44x	30.29s -> 3.55s, 8.53x
PR2044 VAE encode buffer, `vae_conv_direct=false` -> `true`	848.5 MB -> 387.5 MB, -54%	3394 MB -> 1549 MB, -54%
PR2044 VAE decode buffer, `vae_conv_direct=false` -> `true`	1664.5 MB -> 704.5 MB, -58%	6658 MB -> 2818 MB, -58%

Why these improve:

Conditioner time drops because the addon no longer forces the Flux2 text/LLM conditioner to CPU on Apple; PR2044 keeps it on Metal.
Denoising gets faster from the optimized ggml/stable-diffusion.cpp Flux2 Metal path, including fused RoPE/permute and reduced copy overhead.
VAE gets faster and uses much less compute-buffer memory when vae_conv_direct=true, because the direct conv2d path avoids the large im2col-style intermediate buffers used by the generic path.
Toggling vae_conv_direct affects VAE buffers/time only; conditioner and denoise buffers stay unchanged in the benchmark.

Text-to-image: main vs PR2044, `vae_conv_direct=true`

Modality: text-to-image (txt2img) with no init image.

size	branch	total avg	conditioner avg	conditioner buffer	denoise avg	denoise / step	denoise buffer	VAE decode avg	VAE decode buffer
512	`main`	30.31s	16.84s	74 MB RAM	6.22s	3.11s	368.94 MB VRAM	7.25s	704.5 MB VRAM
512	`PR2044`	4.17s	0.51s	75 MB VRAM	2.81s	1.41s	386.94 MB VRAM	0.86s	704.5 MB VRAM
1024	`main`	66.65s	16.83s	74 MB RAM	19.56s	9.78s	1105.44 MB VRAM	30.26s	2818 MB VRAM
1024	`PR2044`	13.40s	0.51s	75 MB VRAM	9.35s	4.68s	1078.44 MB VRAM	3.53s	2818 MB VRAM

Image-to-image: main vs PR2044, `vae_conv_direct=true`

Modality: FLUX2 image-to-image through in-context conditioning via single init_image / ref_images, not SDEdit-style init-image noising.

size	branch	total avg	conditioner avg	conditioner buffer	denoise avg	denoise / step	denoise buffer	VAE decode avg	VAE encode buffer	VAE decode buffer
512	`main`	24.09s	8.41s	74 MB RAM	5.28s	2.64s	604.44 MB VRAM	7.26s	387.5 MB VRAM	704.5 MB VRAM
512	`PR2044`	3.98s	0.25s	75 MB VRAM	2.43s	1.22s	634.44 MB VRAM	0.86s	387.5 MB VRAM	704.5 MB VRAM
1024	`main`	71.54s	8.41s	74 MB RAM	19.86s	9.93s	2047.44 MB VRAM	30.29s	1549 MB VRAM	2818 MB VRAM
1024	`PR2044`	15.81s	0.25s	75 MB VRAM	10.20s	5.10s	1996.44 MB VRAM	3.55s	1549 MB VRAM	2818 MB VRAM

Image-to-image: PR2044, `vae_conv_direct` toggle

size	`vae_conv_direct`	total avg	conditioner avg	conditioner buffer	denoise avg	denoise / step	denoise buffer	VAE decode avg	VAE encode buffer	VAE decode buffer
512	`true`	3.98s	0.25s	75 MB VRAM	2.43s	1.22s	634.44 MB VRAM	0.86s	387.5 MB VRAM	704.5 MB VRAM
512	`false`	5.04s	0.25s	75 MB VRAM	2.43s	1.22s	634.44 MB VRAM	1.60s	848.5 MB VRAM	1664.5 MB VRAM
1024	`true`	15.81s	0.25s	75 MB VRAM	10.20s	5.10s	1996.44 MB VRAM	3.55s	1549 MB VRAM	2818 MB VRAM
1024	`false`	19.99s	0.26s	75 MB VRAM	10.20s	5.10s	1996.44 MB VRAM	6.50s	3394 MB VRAM	6658 MB VRAM

Stable Diffusion models: main vs PR2044

Modality: text-to-image (txt2img) using the existing SD examples. SD2.1 uses stable-diffusion-v2-1-Q8_0.gguf, 768x768, 5 steps, cfg_scale: 7.5, prediction: v. SD3 Medium uses sd3_medium_incl_clips.safetensors, 512x512, 28 steps, cfg_scale: 5.0, sampling_method: euler, prediction: flow, flow_shift: 3.0. SDXL uses stable-diffusion-xl-base-1.0-Q4_0.gguf, 512x512, 5 steps, cfg_scale: 6.5.

model	branch	total avg	conditioner avg	conditioner buffer	denoise avg	denoise / step	denoise buffer	VAE decode avg	VAE decode buffer
SD2.1 `768x768`, 5 steps	`main`	47.80s	0.20s	1.89 MB RAM	30.64s	6.13s	247.94 MB VRAM	16.96s	1584.14 MB VRAM
SD2.1 `768x768`, 5 steps	`PR2044`	6.00s	0.09s	1.89 MB VRAM	3.95s	0.79s	247.94 MB VRAM	1.96s	1584.14 MB VRAM
SD3 Medium `512x512`, 28 steps	`main`	19.39s	1.71s	1.42/2.35 MB RAM	10.38s	0.37s	66.4 MB VRAM	7.29s	704.25 MB VRAM
SD3 Medium `512x512`, 28 steps	`PR2044`	11.25s	0.16s	1.42/2.36 MB VRAM	10.24s	0.37s	66.4 MB VRAM	0.85s	704.25 MB VRAM
SDXL Base Q4_0 `512x512`, 5 steps	`main`	22.84s	1.61s	1.42/2.35 MB RAM	13.92s	2.78s	70.67 MB VRAM	7.32s	960.06 MB VRAM
SDXL Base Q4_0 `512x512`, 5 steps	`PR2044`	3.61s	0.13s	1.42/2.36 MB VRAM	2.61s	0.52s	70.67 MB VRAM	0.88s	960.06 MB VRAM

Output Quality Sanity Check

Settings: one txt2img generation per branch using the same prompt, seed, model files, and quality-oriented params. Outputs are saved under quality-images/; side-by-side contact sheets are in quality-images/contact/.

model	quality params	result
Flux2 Klein Q8_0	`512x512`, 12 steps, `guidance: 3.5`, `cfg_scale: 1.0`, `prediction: flux2_flow`, seed `42`	Retuned from the original 4-step check. PR2044 and main both produce a coherent full-body fox with normal anatomy; no quality regression or artifact was visible.
SD2.1 Q8_0	`768x768`, 20 steps, `cfg_scale: 7.5`, `prediction: v`, seed `42`	PR2044 is visually equivalent to main; composition, lighting, and fox anatomy match closely.
SD3 Medium	`512x512`, 28 steps, `cfg_scale: 5.0`, `sampling_method: euler`, `prediction: flow`, `flow_shift: 3.0`, seed `42`	PR2044 is visually equivalent to main; no denoise or VAE artifact was visible.
SDXL Base Q4_0	`512x512`, 30 steps, `cfg_scale: 6.5`, seed `15`	PR2044 is visually equivalent to main. The image is coherent but stylized at `512x512`; this checks regression/artifacts, not full native SDXL 1024 quality.

Breaking changes

None.

Co-authored-by: Cursor <cursoragent@cursor.com>

…message) Co-authored-by: Cursor <cursoragent@cursor.com>

Point the diffusion ggml overlay at the PR commit with the RoPE Flux and conv2d direct safety fixes. Co-authored-by: Cursor <cursoragent@cursor.com>

Co-authored-by: Cursor <cursoragent@cursor.com>

Resolve diffusion-cpp conflicts by keeping PR2044's Metal conditioner behavior while carrying forward upstream integration-test cleanup. Co-authored-by: Cursor <cursoragent@cursor.com>

…e-flux2

Default diffusion and VAE direct convolution paths to enabled while preserving explicit false overrides. Co-authored-by: Cursor <cursoragent@cursor.com>

Point the diffusion-cpp ggml overlay at the latest qvac-ext-ggml PR commit so PR2044 picks up the newest Metal RoPE dispatch guard. Co-authored-by: Cursor <cursoragent@cursor.com>

Co-authored-by: Cursor <cursoragent@cursor.com>

Consume the merged ggml and stable-diffusion-cpp registry pins for the Flux optimization stack and remove the temporary in-package overlay ports. Co-authored-by: Cursor <cursoragent@cursor.com>

Mark the merged registry-port migration as the 0.9.0 diffusion-cpp release. Co-authored-by: Cursor <cursoragent@cursor.com>

github-actions · 2026-05-21T19:00:04Z

Mobile integration tests — @qvac/diffusion-cpp (iOS)

Result: passed

metric	value
Devices passed	2
Devices failed	0
Test cases total	6
Test cases passed	6
Test cases failed	0
Test cases skipped	0

View workflow run

github-actions · 2026-05-21T19:19:47Z

Mobile integration tests — @qvac/diffusion-cpp (Android)

Result: passed

metric	value
Devices passed	3
Devices failed	0
Test cases total	9
Test cases passed	9
Test cases failed	0
Test cases skipped	0

View workflow run

Co-authored-by: Cursor <cursoragent@cursor.com>

gianni-cor · 2026-05-22T10:01:54Z

/review

github-actions · 2026-05-22T10:02:21Z

Tier-based Approval Status

**PR Tier:** TIER1

**Current Status:** ✅ APPROVED

**Requirements:**
- 1 Team Member approval ✅ (2/1)
- 1 Team Lead OR Management approval ✅ (1/1)



---
*This comment is automatically updated when reviews change.*

gianni-cor · 2026-05-22T12:26:19Z

/review

gianni-cor requested review from a team as code owners May 13, 2026 23:55

gianni-cor requested a deployment to release May 13, 2026 23:56 — with GitHub Actions Waiting

gianni-cor requested a deployment to release May 13, 2026 23:57 — with GitHub Actions Waiting

gianni-cor had a problem deploying to release May 14, 2026 00:02 — with GitHub Actions Failure

gianni-cor temporarily deployed to release May 14, 2026 00:03 — with GitHub Actions Inactive

gianni-cor had a problem deploying to release May 14, 2026 00:03 — with GitHub Actions Failure

gianni-cor had a problem deploying to release May 14, 2026 00:03 — with GitHub Actions Error

gianni-cor had a problem deploying to release May 14, 2026 00:03 — with GitHub Actions Failure

gianni-cor added the verified Authorize secrets / label-gate in PR workflows label May 14, 2026

gianni-cor temporarily deployed to release May 14, 2026 11:21 — with GitHub Actions Inactive

gianni-cor and others added 12 commits May 14, 2026 23:56

js lint

ddc2d3b

fix: update ggml overlay REF to cleaned single commit

f085bb7

Co-authored-by: Cursor <cursoragent@cursor.com>

fix: update ggml overlay to cleaned commit (removed Android fix from …

f0cb60c

…message) Co-authored-by: Cursor <cursoragent@cursor.com>

fix: update ggml overlay to hardened commit

a835956

Point the diffusion ggml overlay at the PR commit with the RoPE Flux and conv2d direct safety fixes. Co-authored-by: Cursor <cursoragent@cursor.com>

Merge branch 'main' into feat/ggml-fused-rope-flux2

06c0ea2

Merge branch 'main' into feat/ggml-fused-rope-flux2

02e4dce

QVAC-18986 chore[diffusion]: add stable-diffusion-cpp overlay port

058a8b2

Co-authored-by: Cursor <cursoragent@cursor.com>

QVAC-18986 fix[diffusion]: honor CLIP CPU setting on Apple

e3b87ee

Co-authored-by: Cursor <cursoragent@cursor.com>

QVAC-18986 chore[diffusion]: merge main into PR2044

088cee8

Resolve diffusion-cpp conflicts by keeping PR2044's Metal conditioner behavior while carrying forward upstream integration-test cleanup. Co-authored-by: Cursor <cursoragent@cursor.com>

Merge remote-tracking branch 'upstream/main' into feat/ggml-fused-rop…

7e144db

…e-flux2

QVAC-18986 fix[diffusion]: enable direct conv by default

b831b91

Default diffusion and VAE direct convolution paths to enabled while preserving explicit false overrides. Co-authored-by: Cursor <cursoragent@cursor.com>

QVAC-18986 chore[diffusion]: bump ggml overlay ref

f982fe9

Point the diffusion-cpp ggml overlay at the latest qvac-ext-ggml PR commit so PR2044 picks up the newest Metal RoPE dispatch guard. Co-authored-by: Cursor <cursoragent@cursor.com>

aegioscy previously approved these changes May 20, 2026

View reviewed changes

gianni-cor and others added 4 commits May 20, 2026 15:30

QVAC-18986 chore[diffusion]: bump stable-diffusion overlay ref

28e2a31

Co-authored-by: Cursor <cursoragent@cursor.com>

QVAC-18986 chore[diffusion]: use merged registry ports

3e61e11

Consume the merged ggml and stable-diffusion-cpp registry pins for the Flux optimization stack and remove the temporary in-package overlay ports. Co-authored-by: Cursor <cursoragent@cursor.com>

Merge branch 'main' into feat/ggml-fused-rope-flux2

7779d3e

QVAC-18986 chore[diffusion]: bump package to 0.9.0

21c986e

Mark the merged registry-port migration as the 0.9.0 diffusion-cpp release. Co-authored-by: Cursor <cursoragent@cursor.com>

gianni-cor and others added 2 commits May 21, 2026 21:41

QVAC-18986 test[diffusion]: enable native logs in integration tests

6c4258b

Co-authored-by: Cursor <cursoragent@cursor.com>

QVAC-18986 chore[diffusion]: keep registry baseline unchanged

bae77fe

Co-authored-by: Cursor <cursoragent@cursor.com>

jpgaribotti approved these changes May 21, 2026

View reviewed changes

Merge branch 'main' into feat/ggml-fused-rope-flux2

b369eaa

aegioscy approved these changes May 22, 2026

View reviewed changes

gabrielgrigoras-serv approved these changes May 22, 2026

View reviewed changes

tobi-legan mentioned this pull request May 22, 2026

QVAC-19279 fix: extract logcat_full.txt into console-logs artifact for Android #2196

Merged

2 tasks

Merge branch 'main' into feat/ggml-fused-rope-flux2

894ce02

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

diffusion-cpp: accelerate Flux2 and Stable Diffusion Metal inference#2044

diffusion-cpp: accelerate Flux2 and Stable Diffusion Metal inference#2044
gianni-cor merged 23 commits into
tetherto:mainfrom
gianni-cor:feat/ggml-fused-rope-flux2

gianni-cor commented May 13, 2026 •

edited

Loading

Uh oh!

github-actions Bot commented May 21, 2026 •

edited

Loading

Uh oh!

github-actions Bot commented May 21, 2026 •

edited

Loading

Uh oh!

gianni-cor commented May 22, 2026

Uh oh!

github-actions Bot commented May 22, 2026 •

edited

Loading

Uh oh!

gianni-cor commented May 22, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

gianni-cor commented May 13, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What problem does this PR solve?

How does it solve it?

Dependency PRs

Benchmark setup

Recap: speed and memory impact

Text-to-image: main vs PR2044, vae_conv_direct=true

Image-to-image: main vs PR2044, vae_conv_direct=true

Image-to-image: PR2044, vae_conv_direct toggle

Stable Diffusion models: main vs PR2044

Output Quality Sanity Check

Breaking changes

Uh oh!

github-actions Bot commented May 21, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Mobile integration tests — @qvac/diffusion-cpp (iOS)

Uh oh!

github-actions Bot commented May 21, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Mobile integration tests — @qvac/diffusion-cpp (Android)

Uh oh!

gianni-cor commented May 22, 2026

Uh oh!

github-actions Bot commented May 22, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Tier-based Approval Status

Uh oh!

gianni-cor commented May 22, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

gianni-cor commented May 13, 2026 •

edited

Loading

Text-to-image: main vs PR2044, `vae_conv_direct=true`

Image-to-image: main vs PR2044, `vae_conv_direct=true`

Image-to-image: PR2044, `vae_conv_direct` toggle

github-actions Bot commented May 21, 2026 •

edited

Loading

github-actions Bot commented May 21, 2026 •

edited

Loading

github-actions Bot commented May 22, 2026 •

edited

Loading