vm/benchmark: add EVM performance benchmarks targeting mainnet bottlenecks#19932
Conversation
yperbasis
left a comment
There was a problem hiding this comment.
From Claude:
Issues
- SSTORE benchmarks measure wrong state transitions after warmup (high)
All three SSTORE sub-benchmarks have a warmup call before b.Loop(). This modifies state, and Prepare only resets the access list — not dirty storage. So every measured iteration operates on already-mutated
state:
- zero-to-nonzero: Warmup writes 0xBEEF to all 100 slots. Every b.Loop() iteration then writes 0xBEEF to slots already containing 0xBEEF — a no-op SSTORE (100 gas), not zero-to-nonzero (20K gas). 0% of measured
iterations test what the name says. - nonzero-to-nonzero: Warmup overwrites 1000→2000. Subsequent iterations write 2000→2000 — again no-op SSTORE.
- nonzero-to-zero: Warmup clears slots. Subsequent iterations write 0→0 — zero-to-zero, not nonzero-to-zero.
Fix: recreate state each iteration inside b.Loop(), or at minimum remove the warmup for these linear benchmarks.
- BenchmarkSLOADCold and BenchmarkStorageDiversity have the same problem (medium)
These are also linear (no inner loop), with a warmup call. After warmup, the access list is reset by Prepare, so SLOADs are cold again — that part is fine. But the "cold" designation also affects SSTORE
benchmarks grouped nearby, and a reader might assume the pattern is consistent. More importantly, the warmup call consumes the one-shot gas budget and may OOG, silently returning an error. Since these don't
loop internally, the warmup is unnecessary — just remove it.
- Unused code (low — will fail lint)
- callContract in helpers.go:92-94 — defined but never called (all benchmarks use prepareAndCall)
- addrEOA in helpers.go:16 — defined but never referenced
- _ bool parameter in deployCallChain (bench_call_chain_test.go:294) — dead parameter
- Name helpers are verbose and have bad defaults (nit)
depthName, layerName, slotName, batchName, gasName, sizeName are all hand-written switch statements. depthName(32) returns "depth-N" instead of "depth-32". Replace with fmt.Sprintf:
func depthName(d int) string { return fmt.Sprintf("depth-%d", d) }
- Errors silently discarded on all calls (low)
Every prepareAndCall result is suppressed with //nolint:errcheck. For the gas-until-OOG benchmarks this is intentional (OOG is an error). But for the linear benchmarks with computed gas limits (BenchmarkSSTORE,
BenchmarkSLOADCold, BenchmarkStorageDiversity, BenchmarkERC20BatchTransfers), an unexpected OOG would silently produce garbage results. At minimum, check the warmup call:
_, _, err := prepareAndCall(cfg, addrContract, nil)
require.NoError(b, err)
Minor observations
- Token contract in deployDeFiContracts will underflow after ~5000 loop iterations (500000 / 100). Doesn't affect benchmarking but is cosmetically wrong.
- The README is well-written and provides good context for future developers.
- Using NewWithVersionMap to mirror real parallel execution overhead is a good choice.
- All APIs verified against codebase — types, signatures, and patterns match correctly.
Verdict
The benchmarks fill a real gap (existing Engine X suite covers precompiles but misses DeFi call chains, storage diversity, and compound patterns). The main issue is the SSTORE benchmarks are measuring the wrong
thing — they need state reset between iterations. The unused code will likely fail make lint. Everything else is minor.
yperbasis
left a comment
There was a problem hiding this comment.
From Claude:
Bug: Stack leak in two benchmarks
BenchmarkStackOps and BenchmarkMixedCompute have a net +1 stack item per loop iteration, causing a stack overflow at ~1024 iterations. This makes them terminate in ~0.15ms instead of using their 100M gas budget
(~168ms for equivalent benchmarks). They're measuring EVM setup overhead, not opcode dispatch.
StackOps (bench_interpreter_test.go:427-432): The loop body pushes 1 value + 8 DUPs but only has 8 POPs. Needs 9 POPs (or remove Push(0x42) from inside the loop):
Push(0x42) // +1
DUP1..DUP8 // +8
SWAP1..SWAP4 // +0
POP×8 // -8
Jump // +0
// Net: +1 per iteration → overflow at ~1024
MixedCompute (bench_interpreter_test.go:509-521): Same issue — the arithmetic section produces 1 value, stack ops add 4 via DUPs, memory ops consume some, but cleanup only does 4 POPs. Net +1 per iteration.
Confirmed empirically:
BenchmarkPureArithmetic/add/100M 168ms ← correct (uses full gas)
BenchmarkStackOps/dup-swap/100M 0.15ms ← 1000x too fast (stack overflow)
BenchmarkMixedCompute/mixed/100M 0.15ms ← 1000x too fast (stack overflow)
Minor issues
- Dead code in BenchmarkCallWithValue/with-value (bench_call_chain_test.go:236): The first deployContractWithBalance(statedb, addrContract, nil, ...) is immediately overwritten by the second call on line 240.
Remove it. - DeFi swap balance underflow: The token contracts subtract 100 from slot 0 each loop iteration. After ~5000 inner iterations (within a single OOG call), the from balance hits 0 and wraps around to a large
uint256. Subsequent SSTOREs become zero-to-nonzero transitions (20K gas instead of 5K), changing the gas cost profile mid-measurement. Consider using snapshot/revert like the SSTORE benchmarks do, or giving
tokens a much larger starting balance. - makeAddrs limit (bench_call_chain_test.go:284): raw[19] = byte(i + 1) wraps at 255 addresses. Fine for current usage (max 16), but a comment noting the limit would help.
Verdict
The architecture and most benchmarks are solid. Fix the two stack-leak bugs — they're currently measuring nothing useful.
…necks Based on analysis of 50 mainnet blocks (14,886 txs, 1.53B gas) and bloatnet comparison, these benchmarks target the actual hot paths in real block execution: - Call chains (68.7% of mainnet gas): nested STATICCALL/DELEGATECALL, DeFi swap - Storage access (6% of DeFi gas): cold/warm SLOAD, SSTORE transitions, transient - Token transfers (16.7% of mainnet gas): ERC-20 transfer/transferFrom patterns - Interpreter loop: arithmetic, stack, memory, keccak256 dispatch overhead All benchmarks use versionedio (NewWithVersionMap) to match real parallel execution overhead. Profiling shows ~1M allocs/100M gas dominated by versionedRead/versionWritten tracking (28%), journal revert (23%), and state object storage maps (34%). Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Fix SSTORE benchmarks measuring wrong state transitions: use PushSnapshot/RevertToSnapshot to restore storage between iterations, ensuring each iteration measures the intended transition (zero-to-nonzero, nonzero-to-nonzero, nonzero-to-zero) - Fix SLOADCold and StorageDiversity benchmarks: same snapshot/revert pattern ensures slots are cold each iteration - Fix BatchTransfers: snapshot/revert prevents cumulative state mutation - Remove unused code: callContract helper, addrEOA, dead bool parameter in deployCallChain - Simplify name helpers: replace verbose switch statements with fmt.Sprintf (depthName, layerName, slotName, batchName, gasName, sizeName) - Add explicit OOG comments on errcheck suppressions for looping benchmarks that intentionally run until out-of-gas Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
yperbasis
left a comment
There was a problem hiding this comment.
Still broken: Stack leaks (from review round 2)
BenchmarkStackOps (bench_interpreter_test.go:427-432) — net +1 per iteration:
Push(0x42) // +1
DUP1..DUP8 // +8
SWAP1..SWAP4 // ±0
POP ×8 // -8
// Net: +1 → overflow at ~1024 iterations
Needs 9 POPs or move Push(0x42) outside the loop (before JUMPDEST).
BenchmarkMixedCompute (bench_interpreter_test.go:509-521) — net +1 per iteration:
Push(42) Push(17) ADD Push(3) MUL // produces 1 value
DUP1 DUP2 SWAP1 SWAP2 // +2
DUP1 DUP2 SWAP1 // +2
Push(0) MSTORE // -2
Push(0) MLOAD // net 0
POP POP POP POP // -4
// 1 + 2 + 2 - 2 + 0 - 4 = -1?
Let me recount more carefully. Starting from empty stack each iteration:
- Arithmetic: Push Push ADD Push MUL → 1 item
- Stack ops: DUP1 DUP2 → 3, SWAP1 SWAP2 → 3, DUP1 DUP2 → 5, SWAP1 → 5
- Memory: Push(0) → 6, MSTORE → 4, Push(0) → 5, MLOAD → 5
- Cleanup: POP×4 → 1
Net +1 per iteration. Overflows at ~1024. Both benchmarks complete in ~0.15ms instead of the expected ~168ms — they're measuring EVM startup, not opcode dispatch.
Still present: Dead code in BenchmarkCallWithValue/with-value
bench_call_chain_test.go:236:
deployContractWithBalance(statedb, addrContract, nil, uint256.NewInt(1_000_000_000))
// ... immediately overwritten on line 240:
deployContractWithBalance(statedb, addrContract, code, uint256.NewInt(1_000_000_000))
First call is dead.
Still present: Token balance underflow in looping benchmarks
BenchmarkERC20Transfer, BenchmarkERC20TransferFrom, and BenchmarkDeFiSwapChain subtract 100 from a from-balance each inner loop iteration. Starting at 1,000,000 (or 500,000 for DeFi), balance hits zero after
10,000 (5,000) iterations. With 100M gas, the loop runs ~50K-200K iterations. After underflow, the uint256 wraps and subsequent SSTOREs become zero-to-nonzero (20K gas) for one iteration, shifting the gas
profile. Practically negligible but cosmetically wrong — use a much larger starting balance or snapshot/revert.
Issues from review 1 that appear fixed
- SSTORE benchmarks now use PushSnapshot/RevertToSnapshot without warmup calls
- BenchmarkSLOADCold and BenchmarkStorageDiversity now use snapshot/revert
- Dead code (callContract, addrEOA, dead bool param) removed
- BenchmarkStackOps: add 9th POP to balance PUSH+DUP8 (was +1/iter → overflow at 1024) - BenchmarkMixedCompute: add 5th POP to balance full stack (was +1/iter → overflow at 1024) - BenchmarkCallWithValue: remove dead deployContractWithBalance(nil code) call - DeFi token contracts: increase starting balance from 500K to 1B to prevent uint256 underflow Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…necks (#19932) ## Summary - Adds a new `execution/vm/benchmark/` package with targeted EVM benchmarks based on analysis of 50 mainnet blocks (14,886 txs, 1.53B gas) and bloatnet comparison - Benchmarks cover the actual hot paths: call chains (68.7% gas), storage access, token transfer patterns, and interpreter dispatch - All benchmarks use `versionedio` (NewWithVersionMap) to match real parallel execution overhead ### Benchmark suites | Suite | What it measures | |-------|-----------------| | `BenchmarkCallChain` | Nested STATICCALL/DELEGATECALL, DeFi swap patterns | | `BenchmarkStorage` | Cold/warm SLOAD, SSTORE transitions, transient storage | | `BenchmarkTokenTransfer` | ERC-20 transfer/transferFrom patterns | | `BenchmarkInterpreter` | Arithmetic, stack, memory, keccak256 dispatch | ## Test plan - [x] `go test -run='^$' -bench=. ./execution/vm/benchmark/` compiles and runs - [ ] CI passes 🤖 Generated with [Claude Code](https://claude.com/claude-code) --------- Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
…necks (#19932) ## Summary - Adds a new `execution/vm/benchmark/` package with targeted EVM benchmarks based on analysis of 50 mainnet blocks (14,886 txs, 1.53B gas) and bloatnet comparison - Benchmarks cover the actual hot paths: call chains (68.7% gas), storage access, token transfer patterns, and interpreter dispatch - All benchmarks use `versionedio` (NewWithVersionMap) to match real parallel execution overhead ### Benchmark suites | Suite | What it measures | |-------|-----------------| | `BenchmarkCallChain` | Nested STATICCALL/DELEGATECALL, DeFi swap patterns | | `BenchmarkStorage` | Cold/warm SLOAD, SSTORE transitions, transient storage | | `BenchmarkTokenTransfer` | ERC-20 transfer/transferFrom patterns | | `BenchmarkInterpreter` | Arithmetic, stack, memory, keccak256 dispatch | ## Test plan - [x] `go test -run='^$' -bench=. ./execution/vm/benchmark/` compiles and runs - [ ] CI passes 🤖 Generated with [Claude Code](https://claude.com/claude-code) --------- Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
…necks (#19932) ## Summary - Adds a new `execution/vm/benchmark/` package with targeted EVM benchmarks based on analysis of 50 mainnet blocks (14,886 txs, 1.53B gas) and bloatnet comparison - Benchmarks cover the actual hot paths: call chains (68.7% gas), storage access, token transfer patterns, and interpreter dispatch - All benchmarks use `versionedio` (NewWithVersionMap) to match real parallel execution overhead ### Benchmark suites | Suite | What it measures | |-------|-----------------| | `BenchmarkCallChain` | Nested STATICCALL/DELEGATECALL, DeFi swap patterns | | `BenchmarkStorage` | Cold/warm SLOAD, SSTORE transitions, transient storage | | `BenchmarkTokenTransfer` | ERC-20 transfer/transferFrom patterns | | `BenchmarkInterpreter` | Arithmetic, stack, memory, keccak256 dispatch | ## Test plan - [x] `go test -run='^$' -bench=. ./execution/vm/benchmark/` compiles and runs - [ ] CI passes 🤖 Generated with [Claude Code](https://claude.com/claude-code) --------- Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
Summary
execution/vm/benchmark/package with targeted EVM benchmarks based on analysis of 50 mainnet blocks (14,886 txs, 1.53B gas) and bloatnet comparisonversionedio(NewWithVersionMap) to match real parallel execution overheadBenchmark suites
BenchmarkCallChainBenchmarkStorageBenchmarkTokenTransferBenchmarkInterpreterTest plan
go test -run='^$' -bench=. ./execution/vm/benchmark/compiles and runs🤖 Generated with Claude Code