Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Avoid using YMM registers in gcm_ghash_vpclmulqdq_avx2_1. #2470

Merged
merged 1 commit into from
Mar 11, 2025
Merged

Conversation

briansmith
Copy link
Owner

[Adapted by Brian Smith to ring's prior modifications.]

gcm_ghash_vpclmulqdq_avx2() executes vzeroupper only when len >= 32, as it was supposed to use only xmm registers for len == 16. But actually it was writing to two ymm registers unconditionally for $BSWAP_MASK and $GFPOLY. Therefore, there could be a slow-down in later code using legacy SSE instructions, in the (probably rare) case where gcm_ghash_vpclmulqdq_avx2() was called with len=16 and wasn't followed by a function that does vzeroupper, e.g. aes_gcm_enc_update_vaes_avx2().

(The Windows xmm register restore epilogue of
gcm_ghash_vpclmulqdq_avx2() itself does use legacy SSE instructions, so probably was slowed down by this, but that's just 8 instructions.)

Fix this by updating gcm_ghash_vpclmulqdq_avx2() to correctly use only xmm registers when len=16. This makes it match the similar code in aes-gcm-avx10-x86_64.pl, which does do it correctly, more closely.

Also, make both functions just execute vzeroupper unconditionally, so that it won't be missed again. It's actually only 1 cycle on the CPUs this code runs on, and it no longer seems worth executing conditionally.

@briansmith briansmith changed the title Fix missing vzeroupper in gcm_ghash_vpclmulqdq_avx2() for len=16 Avoid using YMM registers in gcm_ghash_vpclmulqdq_avx2_1. Mar 10, 2025
Copy link

codecov bot commented Mar 10, 2025

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 96.62%. Comparing base (3542fbc) to head (eebf4a6).
Report is 1 commits behind head on main.

Additional details and impacted files
@@           Coverage Diff           @@
##             main    #2470   +/-   ##
=======================================
  Coverage   96.61%   96.62%           
=======================================
  Files         180      180           
  Lines       21814    21814           
  Branches      539      539           
=======================================
+ Hits        21076    21077    +1     
  Misses        623      623           
+ Partials      115      114    -1     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@briansmith
Copy link
Owner Author

@arielb1 @jlizen This may conflict with the thing you're doing.

[Adapted by Brian Smith to *ring*'s prior modifications.]

gcm_ghash_vpclmulqdq_avx2() executes vzeroupper only when len >= 32, as
it was supposed to use only xmm registers for len == 16.  But actually
it was writing to two ymm registers unconditionally for $BSWAP_MASK and
$GFPOLY.  Therefore, there could be a slow-down in later code using
legacy SSE instructions, in the (probably rare) case where
gcm_ghash_vpclmulqdq_avx2() was called with len=16 *and* wasn't followed
by a function that does vzeroupper, e.g. aes_gcm_enc_update_vaes_avx2().

(The Windows xmm register restore epilogue of
gcm_ghash_vpclmulqdq_avx2() itself does use legacy SSE instructions, so
probably was slowed down by this, but that's just 8 instructions.)

Fix this by updating gcm_ghash_vpclmulqdq_avx2() to correctly use only
xmm registers when len=16.  This makes it match the similar code in
aes-gcm-avx10-x86_64.pl, which does do it correctly, more closely.

Also, make both functions just execute vzeroupper unconditionally, so
that it won't be missed again.  It's actually only 1 cycle on the CPUs
this code runs on, and it no longer seems worth executing conditionally.

Change-Id: I975cd5b526e5cdae1a567f4085c2484552bf6bea
@arielb1
Copy link
Contributor

arielb1 commented Mar 11, 2025

@briansmith

My PR works fine on top of the b/xmm branch

@briansmith
Copy link
Owner Author

briansmith commented Mar 11, 2025

I benchmarked the changes to gcm_ghash_vpclmulqdq_avx2_1 with and without the vzeroupper. In both cases, the new version that avoids YMM registers is much, much faster (according to bench/aead.rs) particularly for small inputs.

Mini postmortem: Of course I was curious about whether the vzeroupper was needed when reviewing the original code. But, the negative result was caused by the combination of

  • upstream accepting it wasn't necessary for this case,
  • a brain-fart of misunderstanding the "i128" in vbroadcasti128
  • misunderstanding comments from the author about the code being slower than some other alternatives for small sizes.
  • upstream not using this function for partial blocks like we do.

Luckily, in this case, the result of the misunderstanding was just a performance issue, not a correctness issue.

@briansmith briansmith merged commit e98f47f into main Mar 11, 2025
174 checks passed
@briansmith briansmith deleted the b/xmm branch March 11, 2025 17:18
@briansmith
Copy link
Owner Author

Oh yeah, regarding whether it is worthwhile to remove the vzeroupper: I benchmarked with and without it, and benchmarks showed no difference, so I left it, to match upstream.

@briansmith briansmith added this to the 0.17.14 milestone Mar 11, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants