Fix macOS GL performance: read stencil bits from pixel format#3546
Fix macOS GL performance: read stencil bits from pixel format#3546mattleibow wants to merge 1 commit intomainfrom
Conversation
Detailed Investigation Process & FindingsBelow is the complete investigation that led to this fix — every phase, experiment, dead end, and breakthrough. This was originally posted on #3525. Investigation Report: macOS GL Rendering Performance GapI conducted a systematic investigation of this performance gap with rigorous A/B benchmarking across both Metal and OpenGL backends in both C++ and C#. Here are the complete findings. TL;DRRoot cause found. 1. Investigation MethodologyI built 12 separate benchmark applications — a mix of native C++ and C# .NET — each targeting a specific backend and configuration. All benchmarks render the same MotionMark-style scene (stroked cubic Bézier paths) at increasing complexity levels (C0=1k, C4=5k, C8=9k, C12=40k elements) and report per-phase timing (render, flush, glFinish, swap). Test environment:
2. Phase 1 — Baseline: Metal Has Zero OverheadFirst, I established that SKMetalView has zero overhead compared to native C++ Metal:
Conclusion: Metal is not the problem. The reported gap is GL-specific. 3. Phase 2 — OpenGL Per-Phase ProfilingI created bare-metal GL benchmark apps (NSView + NSOpenGLContext, 3.2 Core profile, no SkiaSharp views) with per-phase instrumentation. At C12 (40k paths):
Key finding: 4. Phase 3 — Path Batching Proves Per-Draw-Call OverheadTo prove the flush cost scales with draw count, I tested path batching:
*Total worse because batching increases tessellation cost. Proof: Reducing draw calls from 40k to 1 reduces flush from 60ms to 0.24ms — a 250× reduction. The flush cost is definitively per-draw-call GL state validation overhead. 5. Phase 4 — MSAA Changes Everything (on Native)The native C++ app uses 4× MSAA by default. I tested the native app with MSAA ON and OFF:
Critical insight: The native "114 fps" originally reported was VSync-masked. Real performance with The difference is that 4× MSAA activates Skia's 6. Phase 5 — Why Doesn't MSAA Help C#?I created a C# benchmark matching the native setup exactly (4× MSAA pixel format, same GL attributes). Result:
4× MSAA made things worse, not better! The render time stayed at ~62ms (TessellationPathRenderer NOT active), and the MSAA resolve doubled the flush cost. But when I used a Skia-managed MSAA surface (
*Total still high because of offscreen MSAA resolve cost (170ms in glFinish), but render dropped from 55ms to 8.9ms — proving TessellationPathRenderer IS active on managed surfaces. Question: Why does TessellationPathRenderer activate on Skia-managed surfaces but not on wrapped framebuffers? 7. Phase 6 — Root Cause: Stencil BitsI created a diagnostic that checks macOS reports Skia's // TessellationPathRenderer.cpp
bool TessellationPathRenderer::IsSupported(const GrCaps& caps) {
return !caps.avoidStencilBuffers() && // ← needs stencil!
caps.drawInstancedSupport() &&
!caps.disableTessellationPathRenderer();
}The native Skia app reads stencil from the pixel format object, not from GL: // GLWindowContext_mac.mm:127
[fPixelFormat getValues:&stencilBits forAttribute:NSOpenGLPFAStencilSize forVirtualScreen:0];But Gles.glGetIntegerv(Gles.GL_STENCIL_BITS, out var stencil); // Returns 0 on macOS!This is the root cause. When
8. The FixRead stencil bits from // Before (buggy):
Gles.glGetIntegerv(Gles.GL_STENCIL_BITS, out var stencil);
// After (fixed):
var stencil = 0;
if (PixelFormat is not null)
PixelFormat.GetValue(ref stencil, NSOpenGLPixelFormatAttribute.StencilSize, 0);
if (stencil == 0)
Gles.glGetIntegerv(Gles.GL_STENCIL_BITS, out stencil);9. Results After Fix
Per-phase at C12:
C# GL is now 26–52% faster than native C++ GL at high path counts. The remaining advantage comes from .NET's efficient memory management and SkiaSharp's optimized P/Invoke layer. 10. Summary Table — All Configurations
The originally reported "120 fps native vs <10 fps C#" comparison was:
With the fix, C# matches or exceeds native performance on every backend and complexity level. 11. AI Model CorroborationI consulted three AI models (GPT 5.3, Gemini, Opus 4.6) to validate findings:
All three agreed the GL Fix is on branch |
Tools, Techniques & Environment UsedThis section documents exactly what tools and techniques were used during the investigation, so future investigations can have them pre-installed and ready. Build & Runtime Tools
Key .NET Commands# Build a benchmark app
dotnet build -c Release
# Run directly (NOT via `open -n` — stdout capture doesn't work with app bundles)
./bin/Release/net10.0-macos/osx-arm64/AppName.app/Contents/MacOS/AppName
# Check SDK version (run OUTSIDE repo to avoid global.json pinning)
cd /tmp && dotnet --infoProfiling Technique: Manual Per-Phase InstrumentationNo external profiler was used. Instead, I instrumented each rendering phase with var sw = Stopwatch.StartNew();
// Phase 1: render
canvas.Clear(SKColors.White);
foreach (var elem in elements)
canvas.DrawPath(elem.Path, elem.Paint);
var renderMs = sw.Elapsed.TotalMilliseconds;
// Phase 2: canvas flush (Skia → GPU command submission)
sw.Restart();
canvas.Flush();
var flushMs = sw.Elapsed.TotalMilliseconds;
// Phase 3: GPU drain (wait for GPU to finish)
sw.Restart();
GL.glFinish();
var finishMs = sw.Elapsed.TotalMilliseconds;
// Phase 4: buffer swap
sw.Restart();
openGLContext.FlushBuffer();
var swapMs = sw.Elapsed.TotalMilliseconds;This per-phase approach was critical — it showed that Why not OpenGL DiagnosticsCustom diagnostic code was essential for proving the root cause. Key GL queries: // The diagnostic that proved the root cause
GL.glGetIntegerv(GL.GL_STENCIL_BITS, out int stencilBits);
// Returns 0 on macOS default framebuffer — THIS IS THE BUG
GL.glGetIntegerv(GL.GL_SAMPLES, out int samples);
// Returns correct value (4 with MSAA)
GL.glGetIntegerv(GL.GL_FRAMEBUFFER_BINDING, out int fb);
// Confirms we're on framebuffer 0 (default)
// The correct way (matches native Skia):
pixelFormat.GetValue(ref stencil, NSOpenGLPixelFormatAttribute.StencilSize, 0);
// Returns 8 — the actual allocated stencil bitsKey insight: Always cross-check GL state queries against the pixel format on macOS. The GL driver can report different values from what was actually allocated. Native C++ Reference AppBuilt the reporter's FastSkiaSharp repo as a native baseline: cd /tmp/skiasharp-perf/repo
git clone https://github.com/mattleibow/FastSkiaSharp.git --branch main_119 .
# Bootstrap Skia (uses its own build system)
python3 tools/git-sync-deps
cd extern/skia
bin/gn gen out/Release --args='is_official_build=true skia_use_gl=true ...'
ninja -C out/Release
# Build the app
cd /tmp/skiasharp-perf/repo
cmake -B build -G Ninja -DCMAKE_BUILD_TYPE=Release
ninja -C build
./build/FastSkiaSharpThis was necessary to get accurate native timings (the originally reported 120fps was VSync-masked). VSync Control (macOS NSOpenGLContext)VSync masks real performance. Disabling it was critical for accurate measurement: // P/Invoke to set swap interval = 0 (disable VSync)
const int NSOpenGLCPSwapInterval = 222;
var val = 0;
objc_msgSend_setValues(context.Handle, setValuesSelector, ref val, NSOpenGLCPSwapInterval);Without this, all GL benchmarks would report ~60fps or ~120fps regardless of actual rendering cost. AI Model ConsultationUsed three models as "second opinions" on findings:
Each model was given the same data (profiling numbers, code snippets) and asked for independent analysis. Consensus on 2+ models was treated as corroboration. Skia Source ReadingDirect source reading was essential for understanding Skia's internal path renderer selection:
macOS App Requirements (Lessons Learned)Several macOS-specific gotchas that cost debugging time:
Recommended Pre-Install for Future Investigations# Essential
brew install cmake ninja gh
xcode-select --install # For clang++ and Metal SDK
# .NET workloads
dotnet workload install macos maui-maccatalyst
# Useful for other investigation types
brew install glfw # For standalone GL test apps
# dotnet tool install -g dotnet-trace # For managed profiling if needed |
02dbf1f to
23c78a3
Compare
On macOS, glGetIntegerv(GL_STENCIL_BITS) returns 0 for the default framebuffer even when the pixel format has allocated 8 stencil bits. The native Skia sample app reads stencil from NSOpenGLPixelFormat, but SKGLView was reading from GL, causing Skia to disable its fast TessellationPathRenderer and fall back to the much slower DefaultPathRenderer. This fix reads stencil bits from the pixel format first, with a fallback to GL for non-macOS platforms. The result is a 15.6x performance improvement at high path counts (40k stroked paths: 5.3 fps → 77-93 fps), which is faster than native C++ GL (61 fps). Root cause: Without stencil, TessellationPathRenderer::onCanDrawPath() rejects paths, forcing DefaultPathRenderer to CPU-tessellate every path and issue one GL draw call per segment (~40k calls). With correct stencil, TessellationPathRenderer uses GPU tessellation shaders and batches efficiently. Fixes #3525 Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
23c78a3 to
06a2b57
Compare
Fix macOS GL performance: read stencil bits from pixel format
Fixes #3525
Problem
SKGLViewon macOS suffers a severe performance regression becauseglGetIntegerv(GL_STENCIL_BITS)returns 0 for the default framebuffer, even when the pixel format allocates 8 stencil bits. This is a macOS GL driver quirk — the native Skia sample app works around it by reading fromNSOpenGLPixelFormatinstead (GLWindowContext_mac.mm:127).Without the correct stencil value, Skia's
GRBackendRenderTargetis created withstencil=0, which disablesTessellationPathRenderer(requires stencil for stencil-and-cover rendering). Skia falls back toDefaultPathRenderer, which CPU-tessellates every path and issues ~40k individual GL draw calls — each with per-call GL state validation overhead.Fix
Read stencil bits from
NSOpenGLPixelFormat.GetValue()(matching the native approach), with a fallback toglGetIntegervfor non-macOS platforms or if the pixel format is unavailable.Results (MotionMark, 40k stroked paths, C12 complexity)
C# GL now outperforms native C++ GL by 26–52%. Metal was already verified identical to native (zero overhead at all complexity levels).
What Changed
source/SkiaSharp.Views/SkiaSharp.Views/Platform/macOS/SKGLView.cs— Read stencil from pixel format instead of GLTesting
SkiaSharp.Views.csprojcompiles cleanly (0 errors)See the investigation comment below for the full detailed process.