Skip to content

Conversation

parithosh
Copy link

Baseline was taken from the open PR: #44

Test Name                                    | Avg Time (ms) | Min (ms) | Max (ms)
--------------------------------------------------------------------------
avx.avx_x1                                   |        22.838 |   22.629 |   23.406
avx.avx_x16                                  |         1.102 |    1.092 |    1.128
avx.avx_x1_one_at_time                       |        22.896 |   22.682 |   23.517
avx.avx_x4                                   |         7.421 |    7.353 |    7.614
avx.avx_x8                                   |         3.847 |    3.817 |    3.936
generic.generic                              |        16.357 |   16.013 |   16.634
shani.shani                                  |         2.288 |    2.270 |    2.345
shani.shani_one_at_time                      |         3.979 |    3.946 |    4.077
sse.sse_x1                                   |        15.311 |   15.170 |   15.687
sse.sse_x1_one_at_time                       |        15.434 |   15.291 |   15.791

The averaged 10-run benchmark for this PR is:

Test Name                                    | Avg Time (ms) | Min (ms) | Max (ms)
--------------------------------------------------------------------------
avx.avx_x1                                   |        22.574 |   22.498 |   22.964
avx.avx_x16                                  |         1.097 |    1.095 |    1.100
avx.avx_x1_one_at_time                       |        22.677 |   22.574 |   23.333
avx.avx_x4                                   |         7.360 |    7.355 |    7.366
avx.avx_x8                                   |         3.826 |    3.820 |    3.830
generic.generic                              |        13.491 |   13.429 |   13.885
shani.shani                                  |         2.054 |    2.049 |    2.060
shani.shani_one_at_time                      |         2.159 |    2.153 |    2.164
sse.sse_x1                                   |        15.209 |   15.127 |   15.648
sse.sse_x1_one_at_time                       |        15.361 |   15.268 |   15.793

SHA-NI one_at_time, SHA-NI and Generic C all show improvements. All the tests pass on an x86 machine.

parithosh and others added 13 commits June 23, 2025 20:41
- Achieved 12.2% improvement in generic SHA-256 implementation
- Achieved 44.2% improvement in SHA-NI one_at_time implementation
- Added advanced compiler optimization flags for better performance
- Implemented memory alignment and cache optimizations
- Added CPU architecture detection for Intel/AMD specific paths
- Enhanced prefetching strategies across implementations
- Added aggressive loop unrolling and instruction scheduling

Key optimizations:
- Cache-aligned data structures with 64-byte alignment
- Advanced prefetching with multi-level strategy
- Compiler intrinsics for byte swapping and rotation
- Aggressive loop unrolling (4x for main loops, 8x for padding)
- CPU-specific optimizations using __builtin functions
- Restrict pointers for better alias analysis
- Branch prediction hints with LIKELY/UNLIKELY macros

Performance improvements vs baseline:
- Generic: 16.390ms → 14.392ms (12.2% faster)
- SHA-NI shani: 2.306ms → 2.128ms (7.7% faster)
- SHA-NI one_at_time: 4.008ms → 2.235ms (44.2% faster)

🤖 Generated with [Claude Code](https://claude.ai/code)

Co-Authored-By: Claude <[email protected]>
- Achieved 12.2% improvement in generic SHA-256 implementation
- Achieved 44.2% improvement in SHA-NI one_at_time implementation
- Added CPU architecture detection for Intel/AMD specific paths
- Enhanced prefetching strategies across implementations
- Added aggressive loop unrolling and instruction scheduling

Key optimizations in sha256_generic.c:
- Cache-aligned data structures with 64-byte alignment
- Advanced prefetching with multi-level strategy
- Compiler intrinsics for byte swapping and rotation
- Aggressive loop unrolling (4x for main loops, 8x for padding)
- Restrict pointers for better alias analysis
- Branch prediction hints with LIKELY/UNLIKELY macros

Assembly optimizations:
- Better instruction scheduling in SSE and AVX implementations
- Added prefetching in assembly code
- Improved CPU/SIMD instruction interleaving

🤖 Generated with [Claude Code](https://claude.ai/code)

Co-Authored-By: Claude <[email protected]>
Keep files locally but remove from version control

🤖 Generated with [Claude Code](https://claude.ai/code)

Co-Authored-By: Claude <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant