Skip to content

perf: Optimize NEXT_INST macro by reordering memory loads for better performance#465

Merged
quake merged 1 commit intonervosnetwork:developfrom
quake:quake/next-inst-opt
Mar 14, 2025
Merged

perf: Optimize NEXT_INST macro by reordering memory loads for better performance#465
quake merged 1 commit intonervosnetwork:developfrom
quake:quake/next-inst-opt

Conversation

@quake
Copy link
Member

@quake quake commented Mar 10, 2025

A simple instruction reordering in the NEXT_INST macro may reduce memory access latency and improve instruction-level parallelism. This change improves performance by approximately 2% on my env (2 diferenet spec cpu) because of hot path.

CPU1

interpret secp256k1_bench via assembly
                        time:   [4.6390 ms 4.6434 ms 4.6492 ms]
                        change: [-2.7022% -2.5134% -2.3102%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 8 outliers among 100 measurements (8.00%)
  4 (4.00%) high mild
  4 (4.00%) high severe

interpret secp256k1_bench via assembly mop
                        time:   [4.7361 ms 4.7442 ms 4.7534 ms]
                        change: [-2.0556% -1.7441% -1.4555%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 8 outliers among 100 measurements (8.00%)
  5 (5.00%) high mild
  3 (3.00%) high severe

interpret secp256k1_bench via assembly mop (memoized decoder)
                        time:   [3.8854 ms 3.9218 ms 3.9775 ms]
                        change: [-1.7528% -0.8066% +0.7752%] (p = 0.21 > 0.05)
                        No change in performance detected.
Found 9 outliers among 100 measurements (9.00%)
  5 (5.00%) high mild
  4 (4.00%) high severe

interpret secp256k1_bench via assembly mop (memoized dynamic length decoder)
                        time:   [3.2486 ms 3.2510 ms 3.2537 ms]
                        change: [-2.6563% -2.4866% -2.3305%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 14 outliers among 100 measurements (14.00%)
  5 (5.00%) high mild
  9 (9.00%) high severe

CPU2

interpret secp256k1_bench via assembly
                        time:   [3.4991 ms 3.5017 ms 3.5046 ms]
                        change: [-2.0495% -1.8489% -1.6756%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 16 outliers among 100 measurements (16.00%)
  2 (2.00%) high mild
  14 (14.00%) high severe

interpret secp256k1_bench via assembly mop
                        time:   [3.4559 ms 3.4607 ms 3.4677 ms]
                        change: [-1.5877% -1.4303% -1.2230%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 13 outliers among 100 measurements (13.00%)
  1 (1.00%) high mild
  12 (12.00%) high severe

interpret secp256k1_bench via assembly mop (memoized decoder)
                        time:   [2.9029 ms 2.9051 ms 2.9077 ms]
                        change: [-1.6066% -1.5125% -1.4196%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 9 outliers among 100 measurements (9.00%)
  1 (1.00%) high mild
  8 (8.00%) high severe

interpret secp256k1_bench via assembly mop (memoized dynamic length decoder)
                        time:   [2.4625 ms 2.4633 ms 2.4642 ms]
                        change: [-2.3027% -2.2052% -2.1223%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 5 outliers among 100 measurements (5.00%)
  2 (2.00%) high mild
  3 (3.00%) high severe

Copy link
Collaborator

@xxuejie xxuejie left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I remembered the code was written as the changes after this PR in the first place. Later in profiling, it was revealed that memory load from INST_ARGS and INST_PC cause enough cache misses, so the instructions are shuffled. Not sure if it is a good idea to move those instructions back again. I can only confirm that the correctness of the code has no issues.

@mohanson
Copy link
Collaborator

Confirmed that the performance has been significantly improved.

interpret secp256k1_bench via assembly
                        time:   [4.5208 ms 4.5343 ms 4.5484 ms]
                        change: [-1.9489% -1.4873% -1.0056%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 1 outliers among 100 measurements (1.00%)
  1 (1.00%) high mild

interpret secp256k1_bench via assembly mop
                        time:   [4.4476 ms 4.4651 ms 4.4878 ms]
                        change: [-2.4842% -1.9465% -1.3296%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 4 outliers among 100 measurements (4.00%)
  2 (2.00%) high mild
  2 (2.00%) high severe

interpret secp256k1_bench via assembly mop (memoized decoder)
                        time:   [3.7501 ms 3.7695 ms 3.7943 ms]
                        change: [-4.0137% -3.3920% -2.7504%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 2 outliers among 100 measurements (2.00%)
  2 (2.00%) high severe

interpret secp256k1_bench via assembly mop (memoized dynamic length decoder)
                        time:   [3.1980 ms 3.2061 ms 3.2149 ms]
                        change: [-2.9800% -2.3961% -1.8728%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 4 outliers among 100 measurements (4.00%)
  3 (3.00%) high mild
  1 (1.00%) high severe

@quake quake merged commit a1d4a9e into nervosnetwork:develop Mar 14, 2025
16 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants