feat(specs,tests): Update all benchmark tests to use `gas_benchmark_value` by marioevz · Pull Request #1935 · ethereum/execution-spec-tests

marioevz · 2025-07-23T00:38:24Z

🗒️ Description

Add check for gas used in blockchain and state tests

blockchain_test and state_test now have a flag called expected_benchmark_gas_used which can be used to specify the expected gas used by the last payload (the benchmark payload).

If the value is not specified, the test is expected to use the value specified by --gas-benchmark-values entirely.

Modified all benchmark tests to use `gas_benchmark_value`

All benchmark tests now use gas_benchmark_value, and also expected_benchmark_gas_used if the test is not expected to use the entire gas (e.g. when the tx is not expected to run out-of-gas and instead consume a specific amount of gas).

🔗 Related Issues or PRs

N/A.

✅ Checklist

All: Ran fast tox checks to avoid unnecessary CI fails, see also Code Standards and Enabling Pre-commit Checks:
```
uvx --with=tox-uv tox -e lint,typecheck,spellcheck,markdownlint
```
All: PR title adheres to the repo standard - it will be used as the squash commit message and should start type(scope):.
All: Considered adding an entry to CHANGELOG.md.
All: Considered updating the online docs in the ./docs/ directory.
All: Set appropriate labels for the changes (only maintainers can apply labels).

tests/benchmark/test_worst_compute.py

tests/benchmark/test_worst_stateful_opcodes.py

jsign · 2025-07-24T22:29:56Z

~~I'll take a look at this PR by tomorrow!~~

OK, pushed it for today :)

marioevz · 2025-07-24T22:51:37Z

Updated the branch with the following changes:

Fix test_empty_block so it no longer requests for unused fixture.
Improve test_worst_storage_access_cold to add cases where it (1) completes execution and SSTOREs affect the state, (2) reverts after SSTOREs, (3) runs out of gas and also reverts SSTOREs.
Fixes to some tests' environments and bumped the maximum environment gas limit, and now every single test can be run from a SINGLE GENESIS FILE 🎉

(ethereum-execution-spec-tests) ➜  execution-spec-tests git:(update-all-benchmark-tests) uv run groupstats ./fixtures/blockchain_tests_engine_x/pre_alloc -vv

Pre-Allocation Statistics Summary
Total groups: 1
Total tests: 3271
Total accounts: 64033

Tests and Accounts per Group
┏━━━━━━━━━━━━━┳━━━━━━━━┳━━━━━━━┳━━━━━━━━━━┓
┃ Group Hash  ┃ Fork   ┃ Tests ┃ Accounts ┃
┡━━━━━━━━━━━━━╇━━━━━━━━╇━━━━━━━╇━━━━━━━━━━┩
│ 0x313d28... │ Prague │  3271 │    64033 │
└─────────────┴────────┴───────┴──────────┘

No need to restart clients anymore to perform every single benchmark we have 😉

cc @jsign @LouisTsai-Csie

jsign

Very nice change! Left some opinions for your consideration.

Lateral question: is there a way to compare the generated fixtures of this branch against master to see if there are only changes in the expected tests that had bugs/improvements? (e.g., not sure if there's some produced "hash" of each fixture output that could be compared to double check).

I'm a paranoid person, so for such a big refactoring I'm always fearing something falling through the cracks.

If it sounds like a rabbit hole, dismiss.

tests/benchmark/test_worst_bytecode.py

tests/benchmark/test_worst_compute.py

tests/benchmark/test_worst_memory.py

tests/benchmark/test_worst_bytecode.py

tests/benchmark/test_worst_stateful_opcodes.py

…an run from a single genesis

… if not specified

…nment

marioevz · 2025-07-25T21:42:28Z

Very nice change! Left some opinions for your consideration.

Lateral question: is there a way to compare the generated fixtures of this branch against master to see if there are only changes in the expected tests that had bugs/improvements? (e.g., not sure if there's some produced "hash" of each fixture output that could be compared to double check).

I'm a paranoid person, so for such a big refactoring I'm always fearing something falling through the cracks.

If it sounds like a rabbit hole, dismiss.

I think I've got your solution with this evmone new feature: ipsilon/evmone#1274

I could give it a try to (1) test it out for pawel, (2) verify that the tests produce similar opcode numbers before and after this PR using the same gas limits.

jsign · 2025-07-25T21:53:24Z

Very nice change! Left some opinions for your consideration.
Lateral question: is there a way to compare the generated fixtures of this branch against master to see if there are only changes in the expected tests that had bugs/improvements? (e.g., not sure if there's some produced "hash" of each fixture output that could be compared to double check).
I'm a paranoid person, so for such a big refactoring I'm always fearing something falling through the cracks.
If it sounds like a rabbit hole, dismiss.

I think I've got your solution with this evmone new feature: ipsilon/evmone#1274

I could give it a try to (1) test it out for pawel, (2) verify that the tests produce similar opcode numbers before and after this PR using the same gas limits.

Makes sense, since actually most "compute" test might not have a visible effect in block header fields such as state root and similar (used gas is a red herring) -- so the opcode count actually sounds better to achieve that. For the "stateful" ones, checking the block header might be enough.

marioevz · 2025-07-25T23:09:02Z

Generated the fixtures from master and from this PR with the opcode counts produced by ipsilon/evmone#1274:

benchmark-fixtures-with-opcode-count.tar.gz

Each fixture in every fixture file now contains a opcode_count element in the _info object, and it contains the number of times every opcode was executed during the test generation.

I have not made the comparison because the name changes make it tricky, but will do it next week.

marioevz · 2025-07-26T17:25:41Z

I gpt'd a script to get the differences (which could definitely be included in the PR to support opcode counts!) and got the results:

https://gist.github.com/marioevz/b745761e0882549c59ab9195b6865386

jsign · 2025-07-26T17:42:03Z

I gpt'd a script to get the differences (which could definitely be included in the PR to support opcode counts!) and got the results:

https://gist.github.com/marioevz/b745761e0882549c59ab9195b6865386

Looks awesome!

marioevz · 2025-07-28T19:23:30Z

Merging since I believe @LouisTsai-Csie comments were addressed 👍

…alue` (ethereum#1935) * feat(specs): Check gas used by benchmark tests matches expectation * feat(tests): Update all benchmark tests to use `gas_benchmark_value` * fix: tox * fix(plugins/benchmarking): Increase GIGA_GAS * fix(tests): Environment usages * docs: Changelog * fix(tox): tests-deployed-benchmark command line * fix: Remove unused parameter * fix: `test_worst_storage_access_cold` and add more cases * fix: `test_worst_bytecode_single_opcode` incorrect env overwrite * fix(plugins): Bump `BENCHMARKING_MAX_GAS` so all benchmarking tests can run from a single genesis * feat(plugins): Pass `env` fixture to the test spec builder by default if not specified * fix(benchmark): Remove `env` from tests where it's unused * fix(benchmark): tests referencing the `fee_recipient` from the environment * fix(plugins): Set benchmark environment based on test mark * fix: minor fix * fix(plugins): Defaults for `env` and `genesis_environment`

marioevz requested review from LouisTsai-Csie and jsign July 23, 2025 00:38

marioevz added type:feat type: Feature feature:benchmark labels Jul 23, 2025

marioevz marked this pull request as ready for review July 23, 2025 02:11

LouisTsai-Csie requested changes Jul 23, 2025

View reviewed changes

tests/benchmark/test_worst_compute.py Outdated Show resolved Hide resolved

tests/benchmark/test_worst_stateful_opcodes.py Show resolved Hide resolved

LouisTsai-Csie reviewed Jul 23, 2025

View reviewed changes

tests/benchmark/test_worst_stateful_opcodes.py Outdated Show resolved Hide resolved

marioevz force-pushed the update-all-benchmark-tests branch from 0e966d4 to 8f1795a Compare July 24, 2025 22:45

jsign approved these changes Jul 24, 2025

View reviewed changes

marioevz added 15 commits July 25, 2025 17:44

feat(specs): Check gas used by benchmark tests matches expectation

678f309

feat(tests): Update all benchmark tests to use gas_benchmark_value

9fb28ab

fix: tox

141fd59

fix(plugins/benchmarking): Increase GIGA_GAS

8f786e3

fix(tests): Environment usages

7095858

docs: Changelog

eb31eae

fix(tox): tests-deployed-benchmark command line

3b12ca5

fix: Remove unused parameter

84834bd

fix: test_worst_storage_access_cold and add more cases

d71cf58

fix: test_worst_bytecode_single_opcode incorrect env overwrite

4d1e550

fix(plugins): Bump BENCHMARKING_MAX_GAS so all benchmarking tests c…

91191dd

…an run from a single genesis

feat(plugins): Pass env fixture to the test spec builder by default…

987101e

… if not specified

fix(benchmark): Remove env from tests where it's unused

956acf8

fix(benchmark): tests referencing the fee_recipient from the enviro…

e10e725

…nment

fix(plugins): Set benchmark environment based on test mark

1fca4d9

marioevz force-pushed the update-all-benchmark-tests branch from 8f1795a to 1fca4d9 Compare July 25, 2025 20:56

fix: minor fix

ecb5c5e

marioevz mentioned this pull request Jul 26, 2025

feat(t8n): Support evmone's --opcode.count option #1956

Merged

8 tasks

fix(plugins): Defaults for env and genesis_environment

48bcea5

marioevz merged commit 377e175 into main Jul 28, 2025
16 checks passed

marioevz deleted the update-all-benchmark-tests branch July 28, 2025 19:23

Conversation

marioevz commented Jul 23, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🗒️ Description

Add check for gas used in blockchain and state tests

Modified all benchmark tests to use gas_benchmark_value

🔗 Related Issues or PRs

✅ Checklist

Uh oh!

Uh oh!

Uh oh!

Uh oh!

jsign commented Jul 24, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

marioevz commented Jul 24, 2025

Uh oh!

jsign left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

marioevz commented Jul 25, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jsign commented Jul 25, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

marioevz commented Jul 25, 2025

Uh oh!

marioevz commented Jul 26, 2025

Uh oh!

jsign commented Jul 26, 2025

Uh oh!

marioevz commented Jul 28, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

marioevz commented Jul 23, 2025 •

edited

Loading

Modified all benchmark tests to use `gas_benchmark_value`

jsign commented Jul 24, 2025 •

edited

Loading

marioevz commented Jul 25, 2025 •

edited

Loading

jsign commented Jul 25, 2025 •

edited

Loading