Skip to content

vm/benchmark: add EVM performance benchmarks targeting mainnet bottlenecks#19932

Merged
mh0lt merged 4 commits into
mainfrom
evm-benchmarks
Mar 17, 2026
Merged

vm/benchmark: add EVM performance benchmarks targeting mainnet bottlenecks#19932
mh0lt merged 4 commits into
mainfrom
evm-benchmarks

Conversation

@mh0lt

@mh0lt mh0lt commented Mar 16, 2026

Copy link
Copy Markdown
Contributor

Summary

  • Adds a new execution/vm/benchmark/ package with targeted EVM benchmarks based on analysis of 50 mainnet blocks (14,886 txs, 1.53B gas) and bloatnet comparison
  • Benchmarks cover the actual hot paths: call chains (68.7% gas), storage access, token transfer patterns, and interpreter dispatch
  • All benchmarks use versionedio (NewWithVersionMap) to match real parallel execution overhead

Benchmark suites

Suite What it measures
BenchmarkCallChain Nested STATICCALL/DELEGATECALL, DeFi swap patterns
BenchmarkStorage Cold/warm SLOAD, SSTORE transitions, transient storage
BenchmarkTokenTransfer ERC-20 transfer/transferFrom patterns
BenchmarkInterpreter Arithmetic, stack, memory, keccak256 dispatch

Test plan

  • go test -run='^$' -bench=. ./execution/vm/benchmark/ compiles and runs
  • CI passes

🤖 Generated with Claude Code

@yperbasis yperbasis left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From Claude:

Issues

  1. SSTORE benchmarks measure wrong state transitions after warmup (high)

All three SSTORE sub-benchmarks have a warmup call before b.Loop(). This modifies state, and Prepare only resets the access list — not dirty storage. So every measured iteration operates on already-mutated
state:

  • zero-to-nonzero: Warmup writes 0xBEEF to all 100 slots. Every b.Loop() iteration then writes 0xBEEF to slots already containing 0xBEEF — a no-op SSTORE (100 gas), not zero-to-nonzero (20K gas). 0% of measured
    iterations test what the name says.
  • nonzero-to-nonzero: Warmup overwrites 1000→2000. Subsequent iterations write 2000→2000 — again no-op SSTORE.
  • nonzero-to-zero: Warmup clears slots. Subsequent iterations write 0→0 — zero-to-zero, not nonzero-to-zero.

Fix: recreate state each iteration inside b.Loop(), or at minimum remove the warmup for these linear benchmarks.

  1. BenchmarkSLOADCold and BenchmarkStorageDiversity have the same problem (medium)

These are also linear (no inner loop), with a warmup call. After warmup, the access list is reset by Prepare, so SLOADs are cold again — that part is fine. But the "cold" designation also affects SSTORE
benchmarks grouped nearby, and a reader might assume the pattern is consistent. More importantly, the warmup call consumes the one-shot gas budget and may OOG, silently returning an error. Since these don't
loop internally, the warmup is unnecessary — just remove it.

  1. Unused code (low — will fail lint)
  • callContract in helpers.go:92-94 — defined but never called (all benchmarks use prepareAndCall)
  • addrEOA in helpers.go:16 — defined but never referenced
  • _ bool parameter in deployCallChain (bench_call_chain_test.go:294) — dead parameter
  1. Name helpers are verbose and have bad defaults (nit)

depthName, layerName, slotName, batchName, gasName, sizeName are all hand-written switch statements. depthName(32) returns "depth-N" instead of "depth-32". Replace with fmt.Sprintf:

func depthName(d int) string { return fmt.Sprintf("depth-%d", d) }

  1. Errors silently discarded on all calls (low)

Every prepareAndCall result is suppressed with //nolint:errcheck. For the gas-until-OOG benchmarks this is intentional (OOG is an error). But for the linear benchmarks with computed gas limits (BenchmarkSSTORE,
BenchmarkSLOADCold, BenchmarkStorageDiversity, BenchmarkERC20BatchTransfers), an unexpected OOG would silently produce garbage results. At minimum, check the warmup call:

_, _, err := prepareAndCall(cfg, addrContract, nil)
require.NoError(b, err)

Minor observations

  • Token contract in deployDeFiContracts will underflow after ~5000 loop iterations (500000 / 100). Doesn't affect benchmarking but is cosmetically wrong.
  • The README is well-written and provides good context for future developers.
  • Using NewWithVersionMap to mirror real parallel execution overhead is a good choice.
  • All APIs verified against codebase — types, signatures, and patterns match correctly.

Verdict

The benchmarks fill a real gap (existing Engine X suite covers precompiles but misses DeFi call chains, storage diversity, and compound patterns). The main issue is the SSTORE benchmarks are measuring the wrong
thing — they need state reset between iterations. The unused code will likely fail make lint. Everything else is minor.

@yperbasis yperbasis left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From Claude:

Bug: Stack leak in two benchmarks

BenchmarkStackOps and BenchmarkMixedCompute have a net +1 stack item per loop iteration, causing a stack overflow at ~1024 iterations. This makes them terminate in ~0.15ms instead of using their 100M gas budget
(~168ms for equivalent benchmarks). They're measuring EVM setup overhead, not opcode dispatch.

StackOps (bench_interpreter_test.go:427-432): The loop body pushes 1 value + 8 DUPs but only has 8 POPs. Needs 9 POPs (or remove Push(0x42) from inside the loop):
Push(0x42) // +1
DUP1..DUP8 // +8
SWAP1..SWAP4 // +0
POP×8 // -8
Jump // +0
// Net: +1 per iteration → overflow at ~1024

MixedCompute (bench_interpreter_test.go:509-521): Same issue — the arithmetic section produces 1 value, stack ops add 4 via DUPs, memory ops consume some, but cleanup only does 4 POPs. Net +1 per iteration.

Confirmed empirically:
BenchmarkPureArithmetic/add/100M 168ms ← correct (uses full gas)
BenchmarkStackOps/dup-swap/100M 0.15ms ← 1000x too fast (stack overflow)
BenchmarkMixedCompute/mixed/100M 0.15ms ← 1000x too fast (stack overflow)

Minor issues

  1. Dead code in BenchmarkCallWithValue/with-value (bench_call_chain_test.go:236): The first deployContractWithBalance(statedb, addrContract, nil, ...) is immediately overwritten by the second call on line 240.
    Remove it.
  2. DeFi swap balance underflow: The token contracts subtract 100 from slot 0 each loop iteration. After ~5000 inner iterations (within a single OOG call), the from balance hits 0 and wraps around to a large
    uint256. Subsequent SSTOREs become zero-to-nonzero transitions (20K gas instead of 5K), changing the gas cost profile mid-measurement. Consider using snapshot/revert like the SSTORE benchmarks do, or giving
    tokens a much larger starting balance.
  3. makeAddrs limit (bench_call_chain_test.go:284): raw[19] = byte(i + 1) wraps at 255 addresses. Fine for current usage (max 16), but a comment noting the limit would help.

Verdict

The architecture and most benchmarks are solid. Fix the two stack-leak bugs — they're currently measuring nothing useful.

mh0lt and others added 3 commits March 16, 2026 23:19
…necks

Based on analysis of 50 mainnet blocks (14,886 txs, 1.53B gas) and bloatnet
comparison, these benchmarks target the actual hot paths in real block execution:

- Call chains (68.7% of mainnet gas): nested STATICCALL/DELEGATECALL, DeFi swap
- Storage access (6% of DeFi gas): cold/warm SLOAD, SSTORE transitions, transient
- Token transfers (16.7% of mainnet gas): ERC-20 transfer/transferFrom patterns
- Interpreter loop: arithmetic, stack, memory, keccak256 dispatch overhead

All benchmarks use versionedio (NewWithVersionMap) to match real parallel
execution overhead. Profiling shows ~1M allocs/100M gas dominated by
versionedRead/versionWritten tracking (28%), journal revert (23%),
and state object storage maps (34%).

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Fix SSTORE benchmarks measuring wrong state transitions: use
  PushSnapshot/RevertToSnapshot to restore storage between iterations,
  ensuring each iteration measures the intended transition (zero-to-nonzero,
  nonzero-to-nonzero, nonzero-to-zero)
- Fix SLOADCold and StorageDiversity benchmarks: same snapshot/revert
  pattern ensures slots are cold each iteration
- Fix BatchTransfers: snapshot/revert prevents cumulative state mutation
- Remove unused code: callContract helper, addrEOA, dead bool parameter
  in deployCallChain
- Simplify name helpers: replace verbose switch statements with
  fmt.Sprintf (depthName, layerName, slotName, batchName, gasName,
  sizeName)
- Add explicit OOG comments on errcheck suppressions for looping
  benchmarks that intentionally run until out-of-gas

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

@yperbasis yperbasis left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Still broken: Stack leaks (from review round 2)

BenchmarkStackOps (bench_interpreter_test.go:427-432) — net +1 per iteration:

Push(0x42) // +1
DUP1..DUP8 // +8
SWAP1..SWAP4 // ±0
POP ×8 // -8
// Net: +1 → overflow at ~1024 iterations

Needs 9 POPs or move Push(0x42) outside the loop (before JUMPDEST).

BenchmarkMixedCompute (bench_interpreter_test.go:509-521) — net +1 per iteration:

Push(42) Push(17) ADD Push(3) MUL // produces 1 value
DUP1 DUP2 SWAP1 SWAP2 // +2
DUP1 DUP2 SWAP1 // +2
Push(0) MSTORE // -2
Push(0) MLOAD // net 0
POP POP POP POP // -4
// 1 + 2 + 2 - 2 + 0 - 4 = -1?

Let me recount more carefully. Starting from empty stack each iteration:

  • Arithmetic: Push Push ADD Push MUL → 1 item
  • Stack ops: DUP1 DUP2 → 3, SWAP1 SWAP2 → 3, DUP1 DUP2 → 5, SWAP1 → 5
  • Memory: Push(0) → 6, MSTORE → 4, Push(0) → 5, MLOAD → 5
  • Cleanup: POP×4 → 1

Net +1 per iteration. Overflows at ~1024. Both benchmarks complete in ~0.15ms instead of the expected ~168ms — they're measuring EVM startup, not opcode dispatch.

Still present: Dead code in BenchmarkCallWithValue/with-value

bench_call_chain_test.go:236:
deployContractWithBalance(statedb, addrContract, nil, uint256.NewInt(1_000_000_000))
// ... immediately overwritten on line 240:
deployContractWithBalance(statedb, addrContract, code, uint256.NewInt(1_000_000_000))

First call is dead.

Still present: Token balance underflow in looping benchmarks

BenchmarkERC20Transfer, BenchmarkERC20TransferFrom, and BenchmarkDeFiSwapChain subtract 100 from a from-balance each inner loop iteration. Starting at 1,000,000 (or 500,000 for DeFi), balance hits zero after
10,000 (5,000) iterations. With 100M gas, the loop runs ~50K-200K iterations. After underflow, the uint256 wraps and subsequent SSTOREs become zero-to-nonzero (20K gas) for one iteration, shifting the gas
profile. Practically negligible but cosmetically wrong — use a much larger starting balance or snapshot/revert.

Issues from review 1 that appear fixed

  • SSTORE benchmarks now use PushSnapshot/RevertToSnapshot without warmup calls
  • BenchmarkSLOADCold and BenchmarkStorageDiversity now use snapshot/revert
  • Dead code (callContract, addrEOA, dead bool param) removed

@yperbasis yperbasis added this to the 3.5.0 milestone Mar 17, 2026
- BenchmarkStackOps: add 9th POP to balance PUSH+DUP8 (was +1/iter → overflow at 1024)
- BenchmarkMixedCompute: add 5th POP to balance full stack (was +1/iter → overflow at 1024)
- BenchmarkCallWithValue: remove dead deployContractWithBalance(nil code) call
- DeFi token contracts: increase starting balance from 500K to 1B to prevent uint256 underflow

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@mh0lt mh0lt merged commit a7c972b into main Mar 17, 2026
37 checks passed
@mh0lt mh0lt deleted the evm-benchmarks branch March 17, 2026 12:40
lupin012 pushed a commit that referenced this pull request Mar 17, 2026
…necks (#19932)

## Summary

- Adds a new `execution/vm/benchmark/` package with targeted EVM
benchmarks based on analysis of 50 mainnet blocks (14,886 txs, 1.53B
gas) and bloatnet comparison
- Benchmarks cover the actual hot paths: call chains (68.7% gas),
storage access, token transfer patterns, and interpreter dispatch
- All benchmarks use `versionedio` (NewWithVersionMap) to match real
parallel execution overhead

### Benchmark suites

| Suite | What it measures |
|-------|-----------------|
| `BenchmarkCallChain` | Nested STATICCALL/DELEGATECALL, DeFi swap
patterns |
| `BenchmarkStorage` | Cold/warm SLOAD, SSTORE transitions, transient
storage |
| `BenchmarkTokenTransfer` | ERC-20 transfer/transferFrom patterns |
| `BenchmarkInterpreter` | Arithmetic, stack, memory, keccak256 dispatch
|

## Test plan

- [x] `go test -run='^$' -bench=. ./execution/vm/benchmark/` compiles
and runs
- [ ] CI passes

🤖 Generated with [Claude Code](https://claude.com/claude-code)

---------

Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
AskAlexSharov pushed a commit that referenced this pull request Mar 18, 2026
…necks (#19932)

## Summary

- Adds a new `execution/vm/benchmark/` package with targeted EVM
benchmarks based on analysis of 50 mainnet blocks (14,886 txs, 1.53B
gas) and bloatnet comparison
- Benchmarks cover the actual hot paths: call chains (68.7% gas),
storage access, token transfer patterns, and interpreter dispatch
- All benchmarks use `versionedio` (NewWithVersionMap) to match real
parallel execution overhead

### Benchmark suites

| Suite | What it measures |
|-------|-----------------|
| `BenchmarkCallChain` | Nested STATICCALL/DELEGATECALL, DeFi swap
patterns |
| `BenchmarkStorage` | Cold/warm SLOAD, SSTORE transitions, transient
storage |
| `BenchmarkTokenTransfer` | ERC-20 transfer/transferFrom patterns |
| `BenchmarkInterpreter` | Arithmetic, stack, memory, keccak256 dispatch
|

## Test plan

- [x] `go test -run='^$' -bench=. ./execution/vm/benchmark/` compiles
and runs
- [ ] CI passes

🤖 Generated with [Claude Code](https://claude.com/claude-code)

---------

Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
AskAlexSharov pushed a commit that referenced this pull request Mar 18, 2026
…necks (#19932)

## Summary

- Adds a new `execution/vm/benchmark/` package with targeted EVM
benchmarks based on analysis of 50 mainnet blocks (14,886 txs, 1.53B
gas) and bloatnet comparison
- Benchmarks cover the actual hot paths: call chains (68.7% gas),
storage access, token transfer patterns, and interpreter dispatch
- All benchmarks use `versionedio` (NewWithVersionMap) to match real
parallel execution overhead

### Benchmark suites

| Suite | What it measures |
|-------|-----------------|
| `BenchmarkCallChain` | Nested STATICCALL/DELEGATECALL, DeFi swap
patterns |
| `BenchmarkStorage` | Cold/warm SLOAD, SSTORE transitions, transient
storage |
| `BenchmarkTokenTransfer` | ERC-20 transfer/transferFrom patterns |
| `BenchmarkInterpreter` | Arithmetic, stack, memory, keccak256 dispatch
|

## Test plan

- [x] `go test -run='^$' -bench=. ./execution/vm/benchmark/` compiles
and runs
- [ ] CI passes

🤖 Generated with [Claude Code](https://claude.com/claude-code)

---------

Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants