Skip to content

Conversation

@pjbgf
Copy link
Owner

@pjbgf pjbgf commented Aug 19, 2025

Introduces a SIMD implementation for the arm64 architecture that is largely based on ARM64 SHA1 and aligns with upstream Go. Extensive comments were added to make it easier to maintain the code going forwards.

The previous non-SIMD implementation was removed, as the SHA1 feature is fairly well supported across modern ARM64 hardware. Keeping both versions in place is not worth the additional complexity.

The check for SHA1 feature is based on github.com/klauspost/cpuid/v2. This is a temporary dependency which will be removed once Go exports the same functionality.

The new arm64 native implementation is ~30% faster and can process ~50% more data, performing better than the cgo alternative:

pkg: github.com/pjbgf/sha1cd/test
                                   │     /tmp/before     │                 /tmp/after                 │
                                   │       sec/op        │       sec/op         vs base                │
CalculateDvMask/generic-2            0.0000005500n ± 64%   0.0000009000n ± 22%  +63.64% (p=0.003 n=10)
CalculateDvMask/native-2             0.0000008500n ± 18%   0.0000009000n ± 22%        ~ (p=1.000 n=10)
CalculateDvMask/cgo-2                 0.000001950n ± 28%    0.000002200n ± 73%        ~ (p=0.170 n=10)
Hash8Bytes/sha1-2                           105.8n ±  5%          105.2n ±  0%   -0.57% (p=0.002 n=10)
Hash8Bytes/sha1cd_native-2                  426.4n ±  0%          298.0n ±  0%  -30.12% (p=0.000 n=10)
Hash8Bytes/sha1cd_generic-2                 539.0n ±  1%          538.5n ±  1%        ~ (p=0.171 n=10)
Hash8Bytes/sha1cd_cgo-2                     1.671µ ±  3%          1.677µ ±  8%        ~ (p=0.955 n=10)
Hash320Bytes/sha1-2                         317.4n ±  1%          318.0n ±  0%        ~ (p=0.323 n=10)
Hash320Bytes/sha1cd_native-2                2.257µ ±  0%          1.483µ ±  1%  -34.29% (p=0.000 n=10)
Hash320Bytes/sha1cd_generic-2               2.938µ ±  1%          2.943µ ±  1%        ~ (p=0.238 n=10)
Hash320Bytes/sha1cd_cgo-2                   3.103µ ±  2%          3.038µ ±  2%   -2.10% (p=0.009 n=10)
Hash1K/sha1-2                               786.1n ±  1%          785.8n ±  1%        ~ (p=1.000 n=10)
Hash1K/sha1cd_native-2                      6.255µ ±  1%          4.033µ ±  0%  -35.53% (p=0.000 n=10)
Hash1K/sha1cd_generic-2                     8.194µ ±  1%          8.218µ ±  0%        ~ (p=0.119 n=10)
Hash1K/sha1cd_cgo-2                         5.856µ ±  1%          5.809µ ±  2%        ~ (p=0.089 n=10)
Hash8K/sha1-2                               5.744µ ±  1%          5.764µ ±  1%        ~ (p=0.739 n=10)
Hash8K/sha1cd_native-2                      48.23µ ±  1%          31.41µ ±  1%  -34.88% (p=0.000 n=10)
Hash8K/sha1cd_generic-2                     63.07µ ±  1%          63.05µ ±  0%        ~ (p=0.481 n=10)
Hash8K/sha1cd_cgo-2                         34.55µ ±  2%          33.83µ ±  3%        ~ (p=0.105 n=10)
HashWithCollision/sha1cd_native-2           7.518µ ±  1%          6.115µ ±  0%  -18.66% (p=0.000 n=10)
HashWithCollision/sha1cd_generic-2          8.793µ ±  1%          8.782µ ±  0%        ~ (p=0.171 n=10)
HashWithCollision/sha1cd_cgo-2              6.591µ ±  1%          6.467µ ±  1%   -1.87% (p=0.004 n=10)
geomean                                     171.9n                162.4n         -5.51%

                                   │ /tmp/before  │              /tmp/after              │
                                   │     B/s      │      B/s       vs base                │
Hash8Bytes/sha1-2                    72.13Mi ± 5%    72.54Mi ± 0%   +0.56% (p=0.001 n=10)
Hash8Bytes/sha1cd_native-2           17.89Mi ± 0%    25.60Mi ± 0%  +43.10% (p=0.000 n=10)
Hash8Bytes/sha1cd_generic-2          14.15Mi ± 1%    14.17Mi ± 1%        ~ (p=0.202 n=10)
Hash8Bytes/sha1cd_cgo-2              4.568Mi ± 3%    4.554Mi ± 8%        ~ (p=0.926 n=10)
Hash320Bytes/sha1-2                  961.3Mi ± 1%    959.6Mi ± 0%        ~ (p=0.247 n=10)
Hash320Bytes/sha1cd_native-2         135.2Mi ± 0%    205.8Mi ± 1%  +52.23% (p=0.000 n=10)
Hash320Bytes/sha1cd_generic-2        103.8Mi ± 1%    103.7Mi ± 1%        ~ (p=0.255 n=10)
Hash320Bytes/sha1cd_cgo-2            98.35Mi ± 2%   100.46Mi ± 2%   +2.15% (p=0.010 n=10)
Hash1K/sha1-2                        1.213Gi ± 1%    1.214Gi ± 1%        ~ (p=1.000 n=10)
Hash1K/sha1cd_native-2               156.1Mi ± 1%    242.2Mi ± 0%  +55.12% (p=0.000 n=10)
Hash1K/sha1cd_generic-2              119.2Mi ± 1%    118.8Mi ± 0%        ~ (p=0.117 n=10)
Hash1K/sha1cd_cgo-2                  166.8Mi ± 1%    168.1Mi ± 2%        ~ (p=0.089 n=10)
Hash8K/sha1-2                        1.328Gi ± 1%    1.324Gi ± 1%        ~ (p=0.739 n=10)
Hash8K/sha1cd_native-2               162.0Mi ± 1%    248.8Mi ± 1%  +53.56% (p=0.000 n=10)
Hash8K/sha1cd_generic-2              123.9Mi ± 1%    123.9Mi ± 0%        ~ (p=0.481 n=10)
Hash8K/sha1cd_cgo-2                  226.1Mi ± 2%    230.9Mi ± 3%        ~ (p=0.109 n=10)
HashWithCollision/sha1cd_native-2    81.19Mi ± 1%    99.83Mi ± 0%  +22.96% (p=0.000 n=10)
HashWithCollision/sha1cd_generic-2   69.42Mi ± 1%    69.50Mi ± 0%        ~ (p=0.196 n=10)
HashWithCollision/sha1cd_cgo-2       92.60Mi ± 1%    94.38Mi ± 1%   +1.92% (p=0.004 n=10)
geomean                              114.8Mi         127.0Mi       +10.62%

pjbgf added 3 commits August 19, 2025 13:45
The native implementation for the DV mask calculation were missing
the noescape directives. For further optimisation, the wrapping funcs
are now marked with nosplit.

Signed-off-by: Paulo Gomes <[email protected]>
Introduces a SIMD implementation for the arm64 architecture that is largely
based on ARM64 SHA1 and aligns with upstream Go. Extensive comments were added
to make it easier to maintain the code going forwards.

The previous non-SIMD implementation was removed, as the SHA1 feature is fairly
well supported across modern ARM64 hardware. Keeping both versions in place
is not worth the additional complexity.

The check for SHA1 feature is based on github.com/klauspost/cpuid/v2. This is a
temporary dependency which will be removed once Go exports the same functionality.

Signed-off-by: Paulo Gomes <[email protected]>
@pjbgf pjbgf merged commit f052d33 into main Sep 4, 2025
13 checks passed
@pjbgf pjbgf deleted the arm branch September 4, 2025 08:44
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant