Background
Nethermind already has a dedicated accelerated path for the Prague-era BLS12-381 G1 MSM precompile.
Relevant code:
src/Nethermind/Nethermind.Evm.Precompiles/zkevm/Bls12381G1MsmPrecompile.cs:22-24
chooses Mul(...) for a single item and Msm(...) for multiple items.
src/Nethermind/Nethermind.Evm.Precompiles/zkevm/Bls12381G1MsmPrecompile.cs:61
calls Accelerators.Bls12381G1Msm(decoded, (nuint)pairCount, output).
src/Nethermind/Nethermind.Precompiles.Benchmark/Bls12381G1MsmBenchmark.cs:9-13
already provides a dedicated benchmark entrypoint for this precompile.
Problem
The current accelerated path does not make it obvious where end-to-end time is actually spent.
The Msm(...) path has three distinct stages:
- input decode and layout rewrite into the trimmed internal representation
Accelerators.Bls12381G1Msm(...)
- output re-encoding into the EVM return shape
Relevant code:
src/Nethermind/Nethermind.Evm.Precompiles/zkevm/Bls12381G1MsmPrecompile.cs:54-63
src/Nethermind/Nethermind.Evm.Precompiles/zkevm/Bls12381G1MsmPrecompile.cs:84-103
src/Nethermind/Nethermind.Evm.Precompiles/zkevm/Bls12381G1MsmPrecompile.cs:107-115
src/Nethermind/Nethermind.Evm.Precompiles/Eip2537.zkevm.cs:61-69
src/Nethermind/Nethermind.Evm.Precompiles/Eip2537.zkevm.cs:14-18
Without stage-level attribution it is hard to tell whether the next optimization should target:
- decode and buffer preparation
- the accelerator boundary itself
- some batch-size threshold effect between the two
Why this matters
src/Nethermind/Nethermind.Evm.Test/Eip2537Tests.cs:87-98
verifies the G1 MSM precompile is enabled after Prague.
src/Nethermind/Nethermind.Evm.Test/Bls12381G1MsmPrecompileTests.cs:10-19
shows there is already dedicated vector-based coverage for this path.
There is also precedent for performance work in this area:
76801d5915 optimisations and cleanup concurrent g1 msm
5e87830335 start implementing concurrent decoding for msm
9b16f46e01 finish concurrent msm decoding
Desired outcome
Extend the existing benchmark coverage around Bls12381G1MsmBenchmark so that we can measure:
- full precompile runtime
- decode and layout cost
- accelerator compute cost
- encode cost
The useful result here would be a repeatable breakdown across representative pairCount sizes, so the next optimization can target the actual hotspot instead of guessing.
Background
Nethermind already has a dedicated accelerated path for the Prague-era BLS12-381 G1 MSM precompile.
Relevant code:
src/Nethermind/Nethermind.Evm.Precompiles/zkevm/Bls12381G1MsmPrecompile.cs:22-24chooses
Mul(...)for a single item andMsm(...)for multiple items.src/Nethermind/Nethermind.Evm.Precompiles/zkevm/Bls12381G1MsmPrecompile.cs:61calls
Accelerators.Bls12381G1Msm(decoded, (nuint)pairCount, output).src/Nethermind/Nethermind.Precompiles.Benchmark/Bls12381G1MsmBenchmark.cs:9-13already provides a dedicated benchmark entrypoint for this precompile.
Problem
The current accelerated path does not make it obvious where end-to-end time is actually spent.
The
Msm(...)path has three distinct stages:Accelerators.Bls12381G1Msm(...)Relevant code:
src/Nethermind/Nethermind.Evm.Precompiles/zkevm/Bls12381G1MsmPrecompile.cs:54-63src/Nethermind/Nethermind.Evm.Precompiles/zkevm/Bls12381G1MsmPrecompile.cs:84-103src/Nethermind/Nethermind.Evm.Precompiles/zkevm/Bls12381G1MsmPrecompile.cs:107-115src/Nethermind/Nethermind.Evm.Precompiles/Eip2537.zkevm.cs:61-69src/Nethermind/Nethermind.Evm.Precompiles/Eip2537.zkevm.cs:14-18Without stage-level attribution it is hard to tell whether the next optimization should target:
Why this matters
src/Nethermind/Nethermind.Evm.Test/Eip2537Tests.cs:87-98verifies the G1 MSM precompile is enabled after Prague.
src/Nethermind/Nethermind.Evm.Test/Bls12381G1MsmPrecompileTests.cs:10-19shows there is already dedicated vector-based coverage for this path.
There is also precedent for performance work in this area:
76801d5915optimisations and cleanup concurrent g1 msm5e87830335start implementing concurrent decoding for msm9b16f46e01finish concurrent msm decodingDesired outcome
Extend the existing benchmark coverage around
Bls12381G1MsmBenchmarkso that we can measure:The useful result here would be a repeatable breakdown across representative
pairCountsizes, so the next optimization can target the actual hotspot instead of guessing.