Skip to content

Conversation

@PeterPtroc
Copy link

Description

This PR introduces a hardware-accelerated implementation of CRC32C for RISC-V processors that support the Zbc or Zbkc extensions.

Motivation

CRC32C is a performance-critical operation in many applications. The current software fallback on RISC-V is slower than what can be achieved using the carry-less multiplication instructions available in the Zbc extension. This change leverages these instructions to improve throughput.

Changes

  • absl/crc/internal/crc_riscv.cc: Implemented AbslCrc32cClmulRiscv using clmul and clmulh instructions via inline assembly. The implementation uses a folding approach similar to the x86/ARM combined implementation.
  • absl/crc/internal/cpu_detect.cc: Added runtime CPU feature detection for RISC-V using riscv_hwprobe on Linux to safely enable the accelerated path only when the hardware supports it.
  • absl/crc/internal/crc.cc: Updated CRCImpl::NewInternal to instantiate the RISC-V implementation when supported hardware is detected.
  • Build System:
    • Updated CMakeLists.txt to detect compiler support for -march=rv64gc_zbc or -march=rv64gc_zbkc and apply it to the specific translation unit.
    • Updated BUILD.bazel to apply -march=rv64gc_zbc for riscv64 builds using GCC/Clang.

Performance

Benchmarks were run on a RISC-V 64-bit system (64 cores @ 2.6GHz).

Benchmark: //absl/crc:crc32c_benchmark

Benchmark Origin (ns) Patch (ns) Speedup
BM_Calculate/500000 892,621 724,135 1.23x
BM_Extend/500000 883,467 731,185 1.21x
BM_Extend/100000000 177,236,199 139,494,419 1.27x
BM_ExtendCacheMiss/100000 268,657,255 210,529,600 1.28x

Throughput (MiB/s)

Benchmark Origin (MiB/s) Patch (MiB/s) Improvement
BM_ExtendCacheMiss/100 294.81 476.85 1.62x
BM_ExtendCacheMiss/1000 502.45 710.23 1.41x
BM_ExtendCacheMiss/100000 533.85 681.29 1.28x

Testing

Ran //absl/crc:all tests on the target hardware.

//absl/crc:crc32c_test                                                   PASSED
//absl/crc:crc_cord_state_test                                           PASSED
//absl/crc:non_temporal_memcpy_test                                      PASSED
//absl/crc:crc_memcpy_test                                               PASSED

All 231 tests in the project passed.

Raw Benchmark Data (Origin)
[* abseil-cpp]$ bazel run //absl/crc:crc32c_benchmark -c opt --enable_bzlmod  --benchmark_min_time=1s
INFO: Analyzed target //absl/crc:crc32c_benchmark (1 packages loaded, 24 targets configured).
INFO: Found 1 target...
Target //absl/crc:crc32c_benchmark up-to-date:
  bazel-bin/absl/crc/crc32c_benchmark
INFO: Elapsed time: 4.602s, Critical Path: 3.86s
INFO: 9 processes: 7 action cache hit, 2 internal, 7 linux-sandbox.
INFO: Build completed successfully, 9 total actions
INFO: Running command line: bazel-bin/absl/crc/crc32c_benchmark <args omitted>
*
Running *.cache/bazel/_bazel_*/77577b7b5e8938aa3fe7898c32bad6a7/execroot/_main/bazel-out/riscv64-opt/bin/absl/crc/crc32c_benchmark
Run on (64 X 2600 MHz CPU s)
CPU Caches:
  L1 Data 64 KiB (x64)
  L1 Instruction 64 KiB (x64)
  L2 Unified 2048 KiB (x16)
  L3 Unified 65536 KiB (x1)
Load Average: 1.10, 2.26, 5.42
------------------------------------------------------------------------
Benchmark                              Time             CPU   Iterations
------------------------------------------------------------------------
BM_Calculate/0                      23.6 ns         23.6 ns     59454613
BM_Calculate/1                      22.9 ns         22.9 ns     61361837
BM_Calculate/100                     249 ns          248 ns      5638146
BM_Calculate/2048                   3695 ns         3679 ns       380494
BM_Calculate/10000                 17662 ns        17625 ns        78892
BM_Calculate/500000               892621 ns       884865 ns         1588
BM_Extend/0                         24.7 ns         24.2 ns     57844026
BM_Extend/1                         22.2 ns         22.2 ns     62762493
BM_Extend/100                        247 ns          246 ns      5720811
BM_Extend/2048                      3672 ns         3668 ns       381549
BM_Extend/10000                    17629 ns        17612 ns        79500
BM_Extend/500000                  883467 ns       881670 ns         1588
BM_Extend/100000000            177236199 ns    176880740 ns            8
BM_ExtendCacheMiss/10          858913660 ns    856805320 ns            2 bytes_per_second=166.959Mi/s
BM_ExtendCacheMiss/100         486452897 ns    485224927 ns            3 bytes_per_second=294.814Mi/s
BM_ExtendCacheMiss/1000        285993958 ns    284708412 ns            5 bytes_per_second=502.448Mi/s
BM_ExtendCacheMiss/100000      268657255 ns    267959252 ns            5 bytes_per_second=533.854Mi/s
BM_ExtendByZeroes/1                 48.1 ns         48.0 ns     29188160
BM_ExtendByZeroes/10                48.0 ns         48.0 ns     29187614
BM_ExtendByZeroes/100               80.3 ns         80.0 ns     17517422
BM_ExtendByZeroes/1000               111 ns          111 ns     12584720
BM_ExtendByZeroes/10000              112 ns          112 ns     12506179
BM_ExtendByZeroes/100000             143 ns          143 ns      9790263
BM_ExtendByZeroes/1000000            143 ns          143 ns      9783364
BM_ExtendByZeroes/1                 48.1 ns         48.0 ns     29184642
BM_ExtendByZeroes/32                48.6 ns         48.4 ns     28922756
BM_ExtendByZeroes/1024              49.5 ns         49.3 ns     28494599
BM_ExtendByZeroes/32768             48.3 ns         48.2 ns     29064528
BM_ExtendByZeroes/1048576           52.7 ns         52.7 ns     26589859
BM_UnextendByZeroes/1               72.2 ns         72.1 ns     19423861
BM_UnextendByZeroes/10              72.2 ns         72.1 ns     19424942
BM_UnextendByZeroes/100              104 ns          104 ns     13506017
BM_UnextendByZeroes/1000             134 ns          133 ns     10440532
BM_UnextendByZeroes/10000            136 ns          136 ns     10318261
BM_UnextendByZeroes/100000           166 ns          166 ns      8425092
BM_UnextendByZeroes/1000000          166 ns          166 ns      8424875
BM_UnextendByZeroes/1               72.3 ns         72.1 ns     19426537
BM_UnextendByZeroes/32              71.4 ns         71.2 ns     19662336
BM_UnextendByZeroes/1024            72.9 ns         72.8 ns     19210142
BM_UnextendByZeroes/32768           72.8 ns         72.7 ns     19252956
BM_UnextendByZeroes/1048576         74.1 ns         74.0 ns     18928105
BM_Concat/1                         47.9 ns         47.9 ns     29371736
BM_Concat/10                        48.1 ns         48.1 ns     29170655
BM_Concat/100                       80.6 ns         80.5 ns     17385097
BM_Concat/1000                       112 ns          112 ns     12477845
BM_Concat/10000                      112 ns          112 ns     12474567
BM_Concat/100000                     144 ns          144 ns      9730563
BM_Concat/1000000                    144 ns          144 ns      9728900
BM_Concat/1                         48.3 ns         48.3 ns     29169483
BM_Concat/32                        49.7 ns         49.6 ns     28212843
BM_Concat/1024                      49.7 ns         49.6 ns     28178132
BM_Concat/32768                     48.9 ns         48.9 ns     28657798
BM_Concat/1048576                   52.0 ns         51.9 ns     26968131
BM_Memcpy/0                         22.7 ns         22.7 ns     62043207 bytes_per_second=0/s
BM_Memcpy/1                         56.5 ns         56.4 ns     24726198 bytes_per_second=16.8945Mi/s
BM_Memcpy/100                        296 ns          296 ns      4731903 bytes_per_second=322.272Mi/s
BM_Memcpy/2048                      3945 ns         3943 ns       345292 bytes_per_second=495.36Mi/s
BM_Memcpy/10000                    20624 ns        20612 ns        66949 bytes_per_second=462.685Mi/s
BM_Memcpy/500000                 1058785 ns      1056609 ns         1324 bytes_per_second=451.29Mi/s
BM_RemoveSuffix/1/1                 70.6 ns         70.5 ns     19850633
BM_RemoveSuffix/100/10              70.6 ns         70.5 ns     19851258
BM_RemoveSuffix/100/100              103 ns          103 ns     13630443
BM_RemoveSuffix/10000/1             70.6 ns         70.5 ns     19850547
BM_RemoveSuffix/10000/100            103 ns          103 ns     13629602
BM_RemoveSuffix/10000/10000          136 ns          135 ns     10338381
BM_RemoveSuffix/500000/1            70.6 ns         70.5 ns     19850918
BM_RemoveSuffix/500000/100           103 ns          103 ns     13629029
BM_RemoveSuffix/500000/10000         136 ns          135 ns     10338541
BM_RemoveSuffix/500000/500000        167 ns          167 ns      8378608
Raw Benchmark Data (Patch)
[* abseil-cpp]$ bazel run //absl/crc:crc32c_benchmark -c opt --enable_bzlmod
  --benchmark_min_time=1s
INFO: Analyzed target //absl/crc:crc32c_benchmark (86 packages loaded, 790 targets configured).
INFO: Found 1 target...
Target //absl/crc:crc32c_benchmark up-to-date:
  bazel-bin/absl/crc/crc32c_benchmark
INFO: Elapsed time: 15.247s, Critical Path: 12.68s
INFO: 79 processes: 3 action cache hit, 2 internal, 77 linux-sandbox.
INFO: Build completed successfully, 79 total actions
INFO: Running command line: bazel-bin/absl/crc/crc32c_benchmark <args omitted>
2025-12-24T23:42:12+08:00
Running *.cache/bazel/_bazel_*/77577b7b5e8938aa3fe7898c32bad6a7/execroot/_main/bazel-out/riscv64-opt/bin/absl/crc/crc32c_benchmark
Run on (64 X 2600 MHz CPU s)
CPU Caches:
  L1 Data 64 KiB (x64)
  L1 Instruction 64 KiB (x64)
  L2 Unified 2048 KiB (x16)
  L3 Unified 65536 KiB (x1)
Load Average: 5.76, 3.73, 6.79
------------------------------------------------------------------------
Benchmark                              Time             CPU   Iterations
------------------------------------------------------------------------
BM_Calculate/0                      24.6 ns         24.5 ns     57072395
BM_Calculate/1                      21.6 ns         21.6 ns     64860229
BM_Calculate/100                     141 ns          141 ns      9985823
BM_Calculate/2048                   2550 ns         2547 ns       549496
BM_Calculate/10000                 12490 ns        12475 ns       112234
BM_Calculate/500000               724135 ns       723005 ns         1908
BM_Extend/0                         25.3 ns         24.8 ns     56430265
BM_Extend/1                         22.6 ns         22.3 ns     62151281
BM_Extend/100                        142 ns          141 ns     10068021
BM_Extend/2048                      2551 ns         2548 ns       549815
BM_Extend/10000                    12501 ns        12482 ns       112073
BM_Extend/500000                  731185 ns       730307 ns         1924
BM_Extend/100000000            139494419 ns    139121212 ns           10
BM_ExtendCacheMiss/10          845394611 ns    842956760 ns            2 bytes_per_second=169.702Mi/s
BM_ExtendCacheMiss/100         301375723 ns    299989360 ns            5 bytes_per_second=476.854Mi/s
BM_ExtendCacheMiss/1000        201975618 ns    201415640 ns            7 bytes_per_second=710.229Mi/s
BM_ExtendCacheMiss/100000      210529600 ns    209970426 ns            7 bytes_per_second=681.292Mi/s
BM_ExtendByZeroes/1                 48.1 ns         48.0 ns     29189021
BM_ExtendByZeroes/10                48.0 ns         48.0 ns     29187804
BM_ExtendByZeroes/100               80.0 ns         79.9 ns     17506395
BM_ExtendByZeroes/1000               111 ns          111 ns     12578695
BM_ExtendByZeroes/10000              112 ns          112 ns     12505440
BM_ExtendByZeroes/100000             143 ns          143 ns      9790994
BM_ExtendByZeroes/1000000            143 ns          143 ns      9795305
BM_ExtendByZeroes/1                 48.0 ns         48.0 ns     29185533
BM_ExtendByZeroes/32                48.5 ns         48.4 ns     28924471
BM_ExtendByZeroes/1024              49.0 ns         49.0 ns     28743816
BM_ExtendByZeroes/32768             48.2 ns         48.2 ns     29060242
BM_ExtendByZeroes/1048576           52.8 ns         52.7 ns     26577469
BM_UnextendByZeroes/1               72.1 ns         72.1 ns     19426883
BM_UnextendByZeroes/10              72.1 ns         72.1 ns     19425254
BM_UnextendByZeroes/100              104 ns          104 ns     13511070
BM_UnextendByZeroes/1000             134 ns          134 ns     10469228
BM_UnextendByZeroes/10000            136 ns          136 ns     10319449
BM_UnextendByZeroes/100000           166 ns          166 ns      8419827
BM_UnextendByZeroes/1000000          166 ns          166 ns      8423423
BM_UnextendByZeroes/1               72.2 ns         72.1 ns     19426136
BM_UnextendByZeroes/32              71.2 ns         71.2 ns     19662884
BM_UnextendByZeroes/1024            72.9 ns         72.8 ns     19220620
BM_UnextendByZeroes/32768           72.8 ns         72.7 ns     19257671
BM_UnextendByZeroes/1048576         74.9 ns         74.8 ns     18706291
BM_Concat/1                         48.3 ns         48.3 ns     29217799
BM_Concat/10                        47.9 ns         47.8 ns     29174333
BM_Concat/100                       80.6 ns         80.5 ns     17385335
BM_Concat/1000                       112 ns          112 ns     12476486
BM_Concat/10000                      112 ns          112 ns     12474688
BM_Concat/100000                     144 ns          144 ns      9730911
BM_Concat/1000000                    144 ns          144 ns      9729315
BM_Concat/1                         48.4 ns         48.2 ns     29230109
BM_Concat/32                        49.7 ns         49.6 ns     28210226
BM_Concat/1024                      49.7 ns         49.7 ns     28174252
BM_Concat/32768                     48.9 ns         48.9 ns     28657662
BM_Concat/1048576                   52.0 ns         51.9 ns     26960900
BM_Memcpy/0                         22.6 ns         22.6 ns     62252103 bytes_per_second=0/s
BM_Memcpy/1                         56.3 ns         56.2 ns     24911656 bytes_per_second=16.9681Mi/s
BM_Memcpy/100                        188 ns          187 ns      7447593 bytes_per_second=508.828Mi/s
BM_Memcpy/2048                      2909 ns         2906 ns       478688 bytes_per_second=672.119Mi/s
BM_Memcpy/10000                    14856 ns        14840 ns        91996 bytes_per_second=642.626Mi/s
BM_Memcpy/500000                  881880 ns       880855 ns         1581 bytes_per_second=541.334Mi/s
BM_RemoveSuffix/1/1                 70.6 ns         70.5 ns     19848533
BM_RemoveSuffix/100/10              70.6 ns         70.5 ns     19848830
BM_RemoveSuffix/100/100              103 ns          103 ns     13629246
BM_RemoveSuffix/10000/1             70.6 ns         70.5 ns     19849210
BM_RemoveSuffix/10000/100            103 ns          103 ns     13628649
BM_RemoveSuffix/10000/10000          136 ns          135 ns     10337918
BM_RemoveSuffix/500000/1            70.6 ns         70.5 ns     19849439
BM_RemoveSuffix/500000/100           103 ns          103 ns     13627596
BM_RemoveSuffix/500000/10000         136 ns          135 ns     10338207
BM_RemoveSuffix/500000/500000        167 ns          167 ns      8378533

This change introduces a hardware-accelerated implementation of CRC32C for RISC-V processors that support the Zbc (Carry-less multiplication) or Zbkc extensions.

Key changes:

- Implemented CRC32AcceleratedRISCV using clmul and clmulh instructions via inline assembly.

- Added runtime CPU feature detection for RISC-V using riscv_hwprobe on Linux to safely enable the accelerated path.

- Updated CRCImpl::NewInternal to instantiate the RISC-V implementation when supported hardware is detected.

- Updated CMakeLists.txt to detect compiler support for -march=rv64gc_zbc or -march=rv64gc_zbkc and apply it to the specific translation unit.

- Updated BUILD.bazel to apply -march=rv64gc_zbc for riscv64 builds using GCC/Clang, following Abseil's existing patterns for architecture-specific flags.

This implementation significantly improves CRC32C throughput on supported RISC-V hardware by utilizing carry-less multiplication instructions instead of the table-based software fallback.

Co-authored-by: gong-flying <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant