Attempt optimisation through improved memory access patterns and enhanced instruction scheduling #47

parithosh · 2025-06-24T20:11:08Z

Baseline was taken from the open PR: #44

Test Name                                    | Avg Time (ms) | Min (ms) | Max (ms)
--------------------------------------------------------------------------
avx.avx_x1                                   |        22.838 |   22.629 |   23.406
avx.avx_x16                                  |         1.102 |    1.092 |    1.128
avx.avx_x1_one_at_time                       |        22.896 |   22.682 |   23.517
avx.avx_x4                                   |         7.421 |    7.353 |    7.614
avx.avx_x8                                   |         3.847 |    3.817 |    3.936
generic.generic                              |        16.357 |   16.013 |   16.634
shani.shani                                  |         2.288 |    2.270 |    2.345
shani.shani_one_at_time                      |         3.979 |    3.946 |    4.077
sse.sse_x1                                   |        15.311 |   15.170 |   15.687
sse.sse_x1_one_at_time                       |        15.434 |   15.291 |   15.791

The averaged 10-run benchmark for this PR is:

Test Name                                    | Avg Time (ms) | Min (ms) | Max (ms)
--------------------------------------------------------------------------
avx.avx_x1                                   |        22.574 |   22.498 |   22.964
avx.avx_x16                                  |         1.097 |    1.095 |    1.100
avx.avx_x1_one_at_time                       |        22.677 |   22.574 |   23.333
avx.avx_x4                                   |         7.360 |    7.355 |    7.366
avx.avx_x8                                   |         3.826 |    3.820 |    3.830
generic.generic                              |        13.491 |   13.429 |   13.885
shani.shani                                  |         2.054 |    2.049 |    2.060
shani.shani_one_at_time                      |         2.159 |    2.153 |    2.164
sse.sse_x1                                   |        15.209 |   15.127 |   15.648
sse.sse_x1_one_at_time                       |        15.361 |   15.268 |   15.793

SHA-NI one_at_time, SHA-NI and Generic C all show improvements. All the tests pass on an x86 machine.

This reverts commit 19b8668.

- Achieved 12.2% improvement in generic SHA-256 implementation - Achieved 44.2% improvement in SHA-NI one_at_time implementation - Added advanced compiler optimization flags for better performance - Implemented memory alignment and cache optimizations - Added CPU architecture detection for Intel/AMD specific paths - Enhanced prefetching strategies across implementations - Added aggressive loop unrolling and instruction scheduling Key optimizations: - Cache-aligned data structures with 64-byte alignment - Advanced prefetching with multi-level strategy - Compiler intrinsics for byte swapping and rotation - Aggressive loop unrolling (4x for main loops, 8x for padding) - CPU-specific optimizations using __builtin functions - Restrict pointers for better alias analysis - Branch prediction hints with LIKELY/UNLIKELY macros Performance improvements vs baseline: - Generic: 16.390ms → 14.392ms (12.2% faster) - SHA-NI shani: 2.306ms → 2.128ms (7.7% faster) - SHA-NI one_at_time: 4.008ms → 2.235ms (44.2% faster) 🤖 Generated with [Claude Code](https://claude.ai/code) Co-Authored-By: Claude <[email protected]>

- Achieved 12.2% improvement in generic SHA-256 implementation - Achieved 44.2% improvement in SHA-NI one_at_time implementation - Added CPU architecture detection for Intel/AMD specific paths - Enhanced prefetching strategies across implementations - Added aggressive loop unrolling and instruction scheduling Key optimizations in sha256_generic.c: - Cache-aligned data structures with 64-byte alignment - Advanced prefetching with multi-level strategy - Compiler intrinsics for byte swapping and rotation - Aggressive loop unrolling (4x for main loops, 8x for padding) - Restrict pointers for better alias analysis - Branch prediction hints with LIKELY/UNLIKELY macros Assembly optimizations: - Better instruction scheduling in SSE and AVX implementations - Added prefetching in assembly code - Improved CPU/SIMD instruction interleaving 🤖 Generated with [Claude Code](https://claude.ai/code) Co-Authored-By: Claude <[email protected]>

Keep files locally but remove from version control 🤖 Generated with [Claude Code](https://claude.ai/code) Co-Authored-By: Claude <[email protected]>

parithosh and others added 13 commits June 23, 2025 20:41

Optimize SHA-NI instruction scheduling

19b8668

try x86 optimsation

3aea5b6

fix test

08d003d

Revert "Optimize SHA-NI instruction scheduling"

73c21f4

This reverts commit 19b8668.

update optimisations with prefetching and instruction scheduling

b7d0f0f

Merge remote-tracking branch 'refs/remotes/origin/main'

38afa23

Remove BASELINE.md and benchmark_average.sh from repository

9ac1f06

Keep files locally but remove from version control 🤖 Generated with [Claude Code](https://claude.ai/code) Co-Authored-By: Claude <[email protected]>

updated optimisations against baseline

fd32afe

init hashtree cpu check

a757f5c

rollback makefile changes

5d1c2a7

update to detect more archs

4f910ed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Attempt optimisation through improved memory access patterns and enhanced instruction scheduling #47

Attempt optimisation through improved memory access patterns and enhanced instruction scheduling #47

Uh oh!

parithosh commented Jun 24, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Attempt optimisation through improved memory access patterns and enhanced instruction scheduling #47

Are you sure you want to change the base?

Attempt optimisation through improved memory access patterns and enhanced instruction scheduling #47

Uh oh!

Conversation

parithosh commented Jun 24, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant