perf: full PGO pipeline - SPGO, CallFrequency layout, hot-cold splitting, cross-module inlining#10877
Conversation
Contributor
There was a problem hiding this comment.
Pull request overview
This PR adjusts the PGO collection workflow to retain low-count methods in the runtime .jit profile data so guarded devirtualization (GDV) type histograms are preserved for more methods, improving JIT inlining opportunities during runtime PGO.
Changes:
- Removes the effective trimming thresholds for
.jitedge/block profile data by setting--min-block/--min-edgeto0. - Updates workflow messaging/comments to reflect that the
.jitdata is being compressed (and retained) rather than aggressively trimmed.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
Member
Author
|
@claude review this |
This comment was marked as outdated.
This comment was marked as outdated.
MarekM25
approved these changes
Mar 19, 2026
kamilchodola
approved these changes
Mar 19, 2026
LukaszRozmej
approved these changes
Mar 19, 2026
RocksDB disposal hangs indefinitely on overlay filesystems (used by EXPB for PGO collection), preventing WritePGOData from flushing the .jit file. The process gets SIGKILL before reaching the PGO write. Add 15s timeout on lifetimeScope.DisposeAsync() so shutdown proceeds even if DB close hangs. This allows the runtime's ProcessExit handler to flush PGO data. TEMPORARY: revert once the snapshot disposal hang is investigated.
…PORARY)" This reverts commit 1d37afb.
…gression Pettis-Hansen was a no-op (no CallWeights) during the benchmark run, so the regression must be from TC delay, cross-module inlining, or warmup changes. Remove TC_CallCountingDelayMs=30 (reverts to default 100ms) to test if this was the cause.
Pettis-Hansen uses an undirected call graph, losing caller-callee directionality. CallFrequency preserves direction (places callees after callers), which the Facebook hfsort paper showed gives 2x better IPC improvement than PH. New PgoTrim subcommand: generate-callchain - Reads .callgraph (IP pairs) + .etlx (method map) - Resolves IPs to method names via binary search on MethodMemoryMap - Outputs CallChainProfile JSON for crossgen2 --callchain-profile - Also writes .sizes file (method name, native size, exclusive samples) for potential CDS (Cache-Directed Sort) implementation Directory.Build.targets: - Uses --method-layout:callfrequency when callchain JSON exists - Falls back to --method-layout:pettishansen when only .mibc available Workflow: - Generates callchain JSON in PgoTrim step - Uploads/downloads/commits as additional PGO artifact
crossgen2's --hot-cold-splitting flag (CORJIT_FLAG_PROCSPLIT) tells the JIT to split R2R method bodies into hot and cold sections during AOT compilation. Cold basic blocks (error paths, exception handlers, rare branches) are moved to a separate .text.cold section. This uses the SPGO block counts from the .mibc (1,361 methods with per-block CPU sample attribution) to identify which blocks are cold. The result is a smaller hot code working set with better I-cache density - the .NET equivalent of BOLT's basic block reordering.
…t build time Store as .gz in repo to reduce commit size. MSBuild target decompresses before Publish so crossgen2 can read the JSON. Same pattern as .jit.gz.
…uild target MSBuild PropertyGroup Exists() conditions are evaluated at load time, before any targets run. The DecompressCallChainProfile target ran BeforeTargets="Publish" which is too late - the Crossgen2ExtraCommandLineArgs was already set to pettishansen by the time the .json was decompressed. Fix: decompress in the Dockerfile RUN step before dotnet publish, so the .json exists when MSBuild evaluates the PropertyGroup conditions. Remove the MSBuild target since it's no longer needed.
…ry.Build.targets PublishReadyToRunComposite and OptimizationPreference=Speed were only in Runner.csproj. Moving them to Directory.Build.targets ensures any project published with R2R (including BDN benchmarks using the R2R toolchain) gets the same settings as production.
4 tasks
Contributor
Block Processing Benchmark ComparisonRun: View workflow run
Detailed statistics
|
Contributor
EXPB Benchmark ComparisonRun: View workflow run superblocksScenario:
realblocksScenario:
|
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Full Profile-Guided Optimization (PGO) pipeline for Nethermind, collecting runtime profiling data and using it to optimize both R2R (ReadyToRun) ahead-of-time compilation and runtime Tier-1 JIT recompilation.
R2R Compile-Time Optimizations
--opt-cross-module:*): Passed to crossgen2 so framework methods (Dictionary.TryGetValue,Span<T>.Slice,Memory<T>.Span, etc.) can be inlined into Nethermind R2R code at build time. Without this, those call sites stay as regular method calls until Tier-1 recompiles at runtime. Safe for Docker images where the framework version is pinned by the base image hash.--method-layout:callfrequency): Uses directed caller-callee edge weights from CPU sampling (917K resolved edges, 2,480 callers) to place callees after their callers in the R2R image. This preserves call direction for better instruction prefetch, unlike Pettis-Hansen which uses an undirected graph. Falls back to Pettis-Hansen when callchain data is unavailable.--hot-cold-splitting): Uses SPGO block counts from the .mibc (1,361 methods with per-block CPU sample attribution) to split R2R method bodies into hot and cold sections. Cold basic blocks (error paths, exception handlers, rare branches) are moved to a.text.coldsection, keeping the hot code working set smaller and improving I-cache density. The .NET equivalent of BOLT's basic block reordering.DOTNET_JitInlinePolicyProfile=1): The Tier-1 JIT inlines more aggressively at hot call sites and less at cold ones, based on the seeded PGO frequency data.EVM Opcode Warmup (
VirtualMachine.Warmup.cs)PushOne(value=1) with multi-word UInt256 values that exercise common arithmetic paths. Value 1 caused Tier-0 PGO to profile degenerate branches - DIV/MOD by 1 takes the trivial fast-path, EXP with base 1 is identity, SHL/SHR by 1 is minimal shift. The seeded edge counts now reflect the branches that mainnet contracts actually take (multi-word division, full remainder, etc.).IWorldState- but warmup uses a different implementation than real block processing. The JIT's Tier-0 GDV profiling records the warmup type, creating bimodal type histograms that prevent devirtualization. By skipping these opcodes, GDV profiles only capture the productionIWorldStatetype from real execution, enabling direct devirtualization instead of slower type-check guards.PGO Data Collection (
collect-pgo-profile.yml)DOTNET_EnableEventPipeover 10,000 mainnet blocks.jit): Runtime PGO data fromDOTNET_WritePGOData- edge counts and guarded devirtualization (GDV) type histograms that drive branch prediction and virtual call elimination at Tier-1. Compressed by PgoTrim.libcoreclr.sobuilt with LTTng tracepoint support (Microsoft SDK ships dummy provider since dotnet/runtime#113876).TC_CallCountingDelayMs=900000prevents Tier-1 recompilation during sampling so the perf map stays valid (without this, 97% of samples fall outside managed code).SPGO and Call Graph Extraction
convert-trace: Injects missing CTF mappings (MethodDetails, MethodILToNativeMap_V1) and converts.trace.zipto.etlxwithKeepAllEvents=trueextract-spgo: Extracts ~9.3M perf CPU sample leaf IPs to.spgofile and ~9.2M caller-callee IP pairs to.callgraphfile from perfcollect'sperf.data.txtcallstacksgenerate-callchain: Resolves.callgraphIPs to method names using the.etlxMethodMemoryMap, outputs CallChainProfile JSON (917K directed edges, 2,480 callers) for crossgen2's--callchain-profile/--method-layout:callfrequencyNethermindPgoPatches.cscompiled into dotnet-pgo at build time:LoadSpgoSamples: reads.spgofor SPGO basic block attribution (~969K samples attributed, ~10% rate)LoadCallGraph: reads.callgraph, resolves IP pairs via MethodMemoryMap, populates call graph and exclusive sample counts for .mibc CallWeightsSafeSmoothAllProfiles: per-method try-catch for FlowSmoothing crash on disconnected flow graphsProfile Data -
.mibc(R2R compile-time)Used by crossgen2 for ahead-of-time R2R compilation decisions:
Profile Data -
.callchain.json.gz(R2R method layout)Stored compressed in repo (222KB). Decompressed at build time by MSBuild target. Contains directed caller-callee edge weights for crossgen2's CallFrequency method layout:
Profile Data -
.jit.gz(Runtime Tier-1 JIT)Stored compressed in repo and Docker image. Decompressed to
nethermind.jitat image build time. The runtime reads it viaDOTNET_ReadPGODatato seed the JIT's PGO data store, giving Tier-1 recompilation edge counts and GDV data from the first recompile without needing a warm-up period:Upstream Issues Found & PRs
Other
security_optsupport + 120s stop timeout (NethermindEth/execution-payloads-benchmarks#9)Type of change
Test plan
.mibcprofile (790KB) includes edge counts, SPGO blocks, GDV, and call graph weights.callchain.json.gz(222KB) with directed caller-callee edges for crossgen2.jit.gzruntime PGO data (7,239 methods, 5,152 devirtualizable GDV sites) compressed by PgoTrimNethermindPgoPatches.cscompiles cleanly into dotnet-pgo (verified locally, 0 warnings)--method-layout:callfrequency,--callchain-profile,--hot-cold-splitting,--opt-cross-module:*.pettishansenNOT present - correctly using CallFrequency.