Skip to content

perf: full PGO pipeline - SPGO, CallFrequency layout, hot-cold splitting, cross-module inlining#10877

Draft
benaadams wants to merge 276 commits intomasterfrom
pgo-2
Draft

perf: full PGO pipeline - SPGO, CallFrequency layout, hot-cold splitting, cross-module inlining#10877
benaadams wants to merge 276 commits intomasterfrom
pgo-2

Conversation

@benaadams
Copy link
Member

@benaadams benaadams commented Mar 19, 2026

Summary

Full Profile-Guided Optimization (PGO) pipeline for Nethermind, collecting runtime profiling data and using it to optimize both R2R (ReadyToRun) ahead-of-time compilation and runtime Tier-1 JIT recompilation.

R2R Compile-Time Optimizations

  • Cross-module inlining (--opt-cross-module:*): Passed to crossgen2 so framework methods (Dictionary.TryGetValue, Span<T>.Slice, Memory<T>.Span, etc.) can be inlined into Nethermind R2R code at build time. Without this, those call sites stay as regular method calls until Tier-1 recompiles at runtime. Safe for Docker images where the framework version is pinned by the base image hash.
  • CallFrequency method layout (--method-layout:callfrequency): Uses directed caller-callee edge weights from CPU sampling (917K resolved edges, 2,480 callers) to place callees after their callers in the R2R image. This preserves call direction for better instruction prefetch, unlike Pettis-Hansen which uses an undirected graph. Falls back to Pettis-Hansen when callchain data is unavailable.
  • Hot-cold splitting (--hot-cold-splitting): Uses SPGO block counts from the .mibc (1,361 methods with per-block CPU sample attribution) to split R2R method bodies into hot and cold sections. Cold basic blocks (error paths, exception handlers, rare branches) are moved to a .text.cold section, keeping the hot code working set smaller and improving I-cache density. The .NET equivalent of BOLT's basic block reordering.
  • Profile-driven inlining (DOTNET_JitInlinePolicyProfile=1): The Tier-1 JIT inlines more aggressively at hot call sites and less at cold ones, based on the seeded PGO frequency data.

EVM Opcode Warmup (VirtualMachine.Warmup.cs)

  • Representative values: Replaced PushOne (value=1) with multi-word UInt256 values that exercise common arithmetic paths. Value 1 caused Tier-0 PGO to profile degenerate branches - DIV/MOD by 1 takes the trivial fast-path, EXP with base 1 is identity, SHL/SHR by 1 is minimal shift. The seeded edge counts now reflect the branches that mainnet contracts actually take (multi-word division, full remainder, etc.).
  • Skip state-touching opcodes: SLOAD, SSTORE, CALL, STATICCALL, DELEGATECALL, CREATE, LOG, BALANCE, etc. are skipped during warmup. These opcodes dispatch through IWorldState - but warmup uses a different implementation than real block processing. The JIT's Tier-0 GDV profiling records the warmup type, creating bimodal type histograms that prevent devirtualization. By skipping these opcodes, GDV profiles only capture the production IWorldState type from real execution, enabling direct devirtualization instead of slower type-check guards.

PGO Data Collection (collect-pgo-profile.yml)

  • EventPipe trace (main PGO container): Collects method load/JIT events, edge/block counts, and GDV type histograms via DOTNET_EnableEventPipe over 10,000 mainnet blocks
  • Edge/block profiling (.jit): Runtime PGO data from DOTNET_WritePGOData - edge counts and guarded devirtualization (GDV) type histograms that drive branch prediction and virtual call elimination at Tier-1. Compressed by PgoTrim.
  • CPU sampling (sampling container): perfcollect (perf + LTTng) captures ~9.3M kernel CPU samples over ~10 minutes alongside CLR events for SPGO block-level attribution and call graph extraction. Custom libcoreclr.so built with LTTng tracepoint support (Microsoft SDK ships dummy provider since dotnet/runtime#113876). TC_CallCountingDelayMs=900000 prevents Tier-1 recompilation during sampling so the perf map stays valid (without this, 97% of samples fall outside managed code).

SPGO and Call Graph Extraction

  • PgoTrim convert-trace: Injects missing CTF mappings (MethodDetails, MethodILToNativeMap_V1) and converts .trace.zip to .etlx with KeepAllEvents=true
  • PgoTrim extract-spgo: Extracts ~9.3M perf CPU sample leaf IPs to .spgo file and ~9.2M caller-callee IP pairs to .callgraph file from perfcollect's perf.data.txt callstacks
  • PgoTrim generate-callchain: Resolves .callgraph IPs to method names using the .etlx MethodMemoryMap, outputs CallChainProfile JSON (917K directed edges, 2,480 callers) for crossgen2's --callchain-profile / --method-layout:callfrequency
  • NethermindPgoPatches.cs compiled into dotnet-pgo at build time:
    • LoadSpgoSamples: reads .spgo for SPGO basic block attribution (~969K samples attributed, ~10% rate)
    • LoadCallGraph: reads .callgraph, resolves IP pairs via MethodMemoryMap, populates call graph and exclusive sample counts for .mibc CallWeights
    • SafeSmoothAllProfiles: per-method try-catch for FlowSmoothing crash on disconnected flow graphs

Profile Data - .mibc (R2R compile-time)

Used by crossgen2 for ahead-of-time R2R compilation decisions:

Data Type Coverage Detail
Edge counts 6,615 methods 38,308 entries, 336M total executions - branch prediction hints
SPGO block counts 1,361 methods ~8K block entries, ~969K attributed CPU samples - hot/cold splitting
GDV type histograms 3,098 methods 8,616 call sites: 5,063 devirtualizable (4,558 monomorphic, 489 polymorphic)
Call graph 2,791 methods 12,365 caller-callee edges, 4,415 methods with ExclusiveWeight
Method histograms ~366 methods ~479 delegate/interface dispatch entries
Total profiled methods 7,867 with instrumentation 32,478 in hot list

Profile Data - .callchain.json.gz (R2R method layout)

Stored compressed in repo (222KB). Decompressed at build time by MSBuild target. Contains directed caller-callee edge weights for crossgen2's CallFrequency method layout:

Metric Value
Resolved edges 917,618
Unique callers 2,480
Methods with samples 4,146
Top caller KeccakHash.ComputeHash (275K edges, 3 callees)
EVM dispatch RunByteCode (45.5K edges, 239 callees)

Profile Data - .jit.gz (Runtime Tier-1 JIT)

Stored compressed in repo and Docker image. Decompressed to nethermind.jit at image build time. The runtime reads it via DOTNET_ReadPGOData to seed the JIT's PGO data store, giving Tier-1 recompilation edge counts and GDV data from the first recompile without needing a warm-up period:

Data Type Coverage Detail
Edge counts 6,615 methods 38,308 entries, 336M executions - branch prediction from first Tier-1 recompile
GDV type histograms 3,284 methods 9,092 sites: 592 monomorphic (direct devirt), 4,560 polymorphic (guarded devirt), 5,152 devirtualizable
Method histograms 365 methods 476 entries - delegate/interface dispatch optimization
Total methods 7,239

Upstream Issues Found & PRs

Other

Type of change

  • Performance improvement
  • New feature (PGO collection pipeline)

Test plan

  • PGO collection workflow runs end-to-end on self-hosted benchmark runner
  • Main trace produces 7,867 profiled methods with edge counts + GDV over 10K blocks
  • SPGO sampling produces 969K attributed CPU samples (from ~9.3M perf samples) across 1,361 methods
  • Call graph: 917K directed edges across 2,480 callers for CallFrequency method layout
  • CallWeights in .mibc: 12,365 edges across 2,791 methods for Pettis-Hansen fallback
  • Hot-cold splitting enabled using SPGO block counts (1,361 methods)
  • Merged .mibc profile (790KB) includes edge counts, SPGO blocks, GDV, and call graph weights
  • .callchain.json.gz (222KB) with directed caller-callee edges for crossgen2
  • .jit.gz runtime PGO data (7,239 methods, 5,152 devirtualizable GDV sites) compressed by PgoTrim
  • Profile stable across consecutive runs (within ~2% variance on all metrics)
  • NethermindPgoPatches.cs compiles cleanly into dotnet-pgo (verified locally, 0 warnings)
  • Verified crossgen2 receives all flags (from local binlog): --method-layout:callfrequency, --callchain-profile, --hot-cold-splitting, --opt-cross-module:*. pettishansen NOT present - correctly using CallFrequency.

Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR adjusts the PGO collection workflow to retain low-count methods in the runtime .jit profile data so guarded devirtualization (GDV) type histograms are preserved for more methods, improving JIT inlining opportunities during runtime PGO.

Changes:

  • Removes the effective trimming thresholds for .jit edge/block profile data by setting --min-block/--min-edge to 0.
  • Updates workflow messaging/comments to reflect that the .jit data is being compressed (and retained) rather than aggressively trimmed.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@benaadams benaadams changed the title perf: keep all PGO methods including low-count GDV data perf: keep all PGO data and enable profile-driven inlining Mar 19, 2026
@benaadams
Copy link
Member Author

@claude review this

@benaadams benaadams requested a review from Copilot March 19, 2026 17:50
@claude

This comment was marked as outdated.

This comment was marked as outdated.

@benaadams benaadams changed the title perf: keep all PGO data and enable profile-driven inlining perf: maximize PGO impact and enable cross-module R2R inlining Mar 19, 2026
@NethermindEth NethermindEth deleted a comment from github-actions bot Mar 19, 2026
@NethermindEth NethermindEth deleted a comment from github-actions bot Mar 19, 2026
@benaadams benaadams changed the title perf: maximize PGO impact and enable cross-module R2R inlining perf: maximize PGO impact, cross-module inlining, and profile-guided method layout Mar 19, 2026
benaadams and others added 3 commits March 22, 2026 23:29
RocksDB disposal hangs indefinitely on overlay filesystems (used by
EXPB for PGO collection), preventing WritePGOData from flushing the
.jit file. The process gets SIGKILL before reaching the PGO write.

Add 15s timeout on lifetimeScope.DisposeAsync() so shutdown proceeds
even if DB close hangs. This allows the runtime's ProcessExit handler
to flush PGO data.

TEMPORARY: revert once the snapshot disposal hang is investigated.
@NethermindEth NethermindEth deleted a comment from github-actions bot Mar 23, 2026
@NethermindEth NethermindEth deleted a comment from github-actions bot Mar 23, 2026
…gression

Pettis-Hansen was a no-op (no CallWeights) during the benchmark run,
so the regression must be from TC delay, cross-module inlining, or
warmup changes. Remove TC_CallCountingDelayMs=30 (reverts to default
100ms) to test if this was the cause.
@benaadams benaadams changed the title perf: full PGO data, cross-module inlining, Pettis-Hansen method layout perf: full PGO pipeline - SPGO, CallFrequency layout, hot-cold splitting, cross-module inlining Mar 23, 2026
Pettis-Hansen uses an undirected call graph, losing caller-callee
directionality. CallFrequency preserves direction (places callees after
callers), which the Facebook hfsort paper showed gives 2x better IPC
improvement than PH.

New PgoTrim subcommand: generate-callchain
- Reads .callgraph (IP pairs) + .etlx (method map)
- Resolves IPs to method names via binary search on MethodMemoryMap
- Outputs CallChainProfile JSON for crossgen2 --callchain-profile
- Also writes .sizes file (method name, native size, exclusive samples)
  for potential CDS (Cache-Directed Sort) implementation

Directory.Build.targets:
- Uses --method-layout:callfrequency when callchain JSON exists
- Falls back to --method-layout:pettishansen when only .mibc available

Workflow:
- Generates callchain JSON in PgoTrim step
- Uploads/downloads/commits as additional PGO artifact
crossgen2's --hot-cold-splitting flag (CORJIT_FLAG_PROCSPLIT) tells the
JIT to split R2R method bodies into hot and cold sections during AOT
compilation. Cold basic blocks (error paths, exception handlers, rare
branches) are moved to a separate .text.cold section.

This uses the SPGO block counts from the .mibc (1,361 methods with
per-block CPU sample attribution) to identify which blocks are cold.
The result is a smaller hot code working set with better I-cache density
- the .NET equivalent of BOLT's basic block reordering.
…t build time

Store as .gz in repo to reduce commit size. MSBuild target decompresses
before Publish so crossgen2 can read the JSON. Same pattern as .jit.gz.
benaadams and others added 10 commits March 23, 2026 03:02
…uild target

MSBuild PropertyGroup Exists() conditions are evaluated at load time,
before any targets run. The DecompressCallChainProfile target ran
BeforeTargets="Publish" which is too late - the Crossgen2ExtraCommandLineArgs
was already set to pettishansen by the time the .json was decompressed.

Fix: decompress in the Dockerfile RUN step before dotnet publish, so the
.json exists when MSBuild evaluates the PropertyGroup conditions. Remove
the MSBuild target since it's no longer needed.
…ry.Build.targets

PublishReadyToRunComposite and OptimizationPreference=Speed were only
in Runner.csproj. Moving them to Directory.Build.targets ensures any
project published with R2R (including BDN benchmarks using the R2R
toolchain) gets the same settings as production.
@github-actions
Copy link
Contributor

Block Processing Benchmark Comparison

Run: View workflow run
Base: 5a6c7795 | Head: 2ffe6b51

Method Base (us) PR (us) Delta Base CV PR CV Alloc Base Alloc PR Alloc Delta
AccessList_50 761.4 754.9 -0.9% 1.0% 2.8% 73.8 KB 73.7 KB -0.1%
ContractCall_200 1,829.4 1,810.2 -1.0% 1.6% 0.5% 367.1 KB 367.2 KB +0.0%
ContractDeploy_10 555.8 567.9 +2.2% 2.9% 0.7% 54.1 KB 51.8 KB -4.2%
Eip1559_200 1,808.0 1,793.7 -0.8% 1.6% 1.3% 350.4 KB 350.1 KB -0.1%
EmptyBlock 24.3 50.7 +108.6% 🔼 15.0% 71.3% 7.0 KB 7.0 KB +0.0%
MixedBlock 1,858.6 1,837.9 -1.1% 1.6% 3.2% 357.1 KB 357.2 KB +0.0%
SingleTransfer 79.0 114.7 +45.1% 🔼 2.1% 36.0% 18.6 KB 18.6 KB +0.0%
Transfers_200 1,832.8 1,806.8 -1.4% 1.3% 1.3% 350.1 KB 350.3 KB +0.1%
Transfers_50 775.9 766.1 -1.3% 1.6% 1.5% 65.2 KB 65.3 KB +0.2%
Detailed statistics
Method Metric Base PR Delta
AccessList_50 Mean 761.4 us 754.9 us -0.9%
AccessList_50 Median 760.8 us 758.1 us -0.3%
AccessList_50 P90 769.0 us 779.4 us +1.4%
AccessList_50 P95 770.7 us 784.0 us +1.7%
AccessList_50 Min 747.6 us 726.0 us -2.9%
AccessList_50 Max 772.4 us 788.5 us +2.1%
AccessList_50 StdDev 7.4 us 20.9 us +184.0%
ContractCall_200 Mean 1,829.4 us 1,810.2 us -1.0%
ContractCall_200 Median 1,827.1 us 1,809.1 us -1.0%
ContractCall_200 P90 1,857.0 us 1,819.9 us -2.0%
ContractCall_200 P95 1,875.3 us 1,822.9 us -2.8%
ContractCall_200 Min 1,799.1 us 1,800.0 us +0.1%
ContractCall_200 Max 1,893.5 us 1,826.0 us -3.6%
ContractCall_200 StdDev 28.5 us 9.2 us -67.6%
ContractDeploy_10 Mean 555.8 us 567.9 us +2.2%
ContractDeploy_10 Median 555.5 us 568.5 us +2.3%
ContractDeploy_10 P90 576.6 us 572.5 us -0.7%
ContractDeploy_10 P95 579.5 us 572.7 us -1.2%
ContractDeploy_10 Min 530.7 us 561.7 us +5.8%
ContractDeploy_10 Max 582.3 us 572.9 us -1.6%
ContractDeploy_10 StdDev 15.9 us 4.3 us -73.3%
Eip1559_200 Mean 1,808.0 us 1,793.7 us -0.8%
Eip1559_200 Median 1,810.5 us 1,798.1 us -0.7%
Eip1559_200 P90 1,836.1 us 1,816.5 us -1.1%
Eip1559_200 P95 1,840.5 us 1,827.6 us -0.7%
Eip1559_200 Min 1,757.0 us 1,760.6 us +0.2%
Eip1559_200 Max 1,844.9 us 1,838.8 us -0.3%
Eip1559_200 StdDev 28.8 us 24.2 us -16.1%
EmptyBlock Mean 24.3 us 50.7 us +108.6%
EmptyBlock Median 25.1 us 22.8 us -9.0%
EmptyBlock P90 27.6 us 88.8 us +221.1%
EmptyBlock P95 27.7 us 89.3 us +222.8%
EmptyBlock Min 17.0 us 16.7 us -2.1%
EmptyBlock Max 27.7 us 89.9 us +224.5%
EmptyBlock StdDev 3.7 us 36.2 us +890.3%
MixedBlock Mean 1,858.6 us 1,837.9 us -1.1%
MixedBlock Median 1,857.5 us 1,845.4 us -0.6%
MixedBlock P90 1,885.2 us 1,909.7 us +1.3%
MixedBlock P95 1,893.6 us 1,911.3 us +0.9%
MixedBlock Min 1,796.2 us 1,732.5 us -3.5%
MixedBlock Max 1,902.1 us 1,912.9 us +0.6%
MixedBlock StdDev 29.6 us 58.2 us +96.4%
SingleTransfer Mean 79.0 us 114.7 us +45.1%
SingleTransfer Median 78.6 us 88.3 us +12.4%
SingleTransfer P90 80.6 us 165.0 us +104.8%
SingleTransfer P95 81.5 us 167.6 us +105.8%
SingleTransfer Min 76.5 us 76.2 us -0.4%
SingleTransfer Max 82.3 us 170.2 us +106.7%
SingleTransfer StdDev 1.6 us 41.3 us +2423.8%
Transfers_200 Mean 1,832.8 us 1,806.8 us -1.4%
Transfers_200 Median 1,832.2 us 1,802.1 us -1.6%
Transfers_200 P90 1,859.3 us 1,827.5 us -1.7%
Transfers_200 P95 1,862.0 us 1,841.0 us -1.1%
Transfers_200 Min 1,788.7 us 1,780.8 us -0.4%
Transfers_200 Max 1,864.7 us 1,854.5 us -0.5%
Transfers_200 StdDev 23.7 us 22.7 us -4.3%
Transfers_50 Mean 775.9 us 766.1 us -1.3%
Transfers_50 Median 774.4 us 766.5 us -1.0%
Transfers_50 P90 790.8 us 777.2 us -1.7%
Transfers_50 P95 794.0 us 781.9 us -1.5%
Transfers_50 Min 758.1 us 752.9 us -0.7%
Transfers_50 Max 797.2 us 786.5 us -1.3%
Transfers_50 StdDev 12.3 us 11.3 us -8.4%

@github-actions
Copy link
Contributor

EXPB Benchmark Comparison

Run: View workflow run

superblocks

Scenario: nethermind-flat-superblocks-pgo-2-delay0s

Metric PR Master (cached) Delta PR vs Master
AVG (ms) 1075.658200 1001.388100 +7.42%
MEDIAN (ms) 949.400000 873.505000 +8.69%
P90 (ms) 1583.12 1532.22 +3.32%
P95 (ms) 1717.53 1867.30 -8.02%
P99 (ms) 3090.95 2386.11 +29.54%
MIN (ms) 650.93 663.77 -1.93%
MAX (ms) 3409.33 2940.23 +15.95%

realblocks

Scenario: nethermind-flat-realblocks-pgo-2-delay0s

Metric PR Master (cached) Delta PR vs Master
AVG (ms) 28.491386 26.483828 +7.58%
MEDIAN (ms) 23.970000 22.445000 +6.79%
P90 (ms) 42.27 39.40 +7.28%
P95 (ms) 52.78 49.35 +6.95%
P99 (ms) 118.11 112.14 +5.32%
MIN (ms) 0.84 1.21 -30.58%
MAX (ms) 3357.05 1269.48 +164.44%

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants