Refactor GPU utilities, improve locality and kernel efficiency, and add profiling support by anaruse · Pull Request #43 · r-ccs-cms/sbd

anaruse · 2026-05-25T11:42:13Z

This PR introduces a set of refactors and optimizations for the SBD GPU implementation, focusing on improving data locality, kernel efficiency, and maintainability.

Key changes

Data locality improvement

Introduce block-based index reordering based on KetIndex (SBD_REORDER_INDEX_ARRAY)
Improve memory access locality without breaking BraIndex ordering
Shift bottleneck from memory access to compute

Kernel efficiency improvements

Reduce register pressure:
- 32-bit parity path (SBD_USE_32BIT_PARITY)
- use float for sign handling
Reduce branching and simplify parity logic
Avoid costly 64-bit division in bit operations
Enable optional thrust::par_nosync execution (SBD_USE_THRUST_NOSYNC)

Communication optimization

Add NCCL-based collectives (SBD_USE_NCCL)
Fuse reductions (e.g., nccl_allreduce2) to reduce launch overhead
Introduce a_comm communicator for better communication grouping
Support configurable MPI rank distribution (block vs cyclic)

cuBLAS-based optimizations

Use GEMV-based batched operations:
- Batched inner products
- Batched AXPY
Reduce kernel launch overhead and improve memory efficiency

Utility refactor (major cleanup)

Introduce:
- cuda_utility.h
- nccl_utility.h
- cublas_utility.h
Unify error handling:
- SBD_CHECK_CUDA
- SBD_CHECK_NCCL
- SBD_CHECK_CUBLAS

Profiling and instrumentation

Add NVTX annotations across key regions:
- Davidson loop
- thrust operations
- MPI/NCCL calls
Enable detailed timeline analysis with Nsight Systems

Build / configuration cleanup

Consolidate performance-related macros in Configuration
Add inline documentation for each flag
Set NVHPC-based configuration as default

Notes

Optimizations follow a stepwise approach:
1. Improve locality → memory bottleneck reduced
2. Compute becomes bottleneck → optimize kernels
Some features are controlled via compile-time flags for flexibility.

This PR improves both performance and code maintainability while preparing the codebase for further optimization and sharing.

…MultiDoubleAlpha, MultSingleBeta, and MultDoubleBeta.

… NCCL-based reductions - Switch Davidson basis storage to flat device buffers (C/HC) to avoid repeated packing - Replace batched inner products and AXPY updates with GEMV on contiguous memory - Introduce NCCL-based reduction in Gram-Schmidt orthogonalization (remove host transfers) - Remove unnecessary host-side operations (e.g., negate) by adjusting GEMV formulation - Improve overlap of compute and communication by using CUDA streams

- Extend Normalize/Normalize2 to avoid unnecessary MPI_Allreduce calls when comm size is 1 - Add device-pointer version of cuBLAS dot to enable GPU-side accumulation - Reduce host-device synchronization in Normalize2 using optional workspace - Remove redundant pre-normalization scaling and apply correction to norms instead - Minor cleanup and workspace usage improvements

- add optional maxregcount build flag - add debug prints for single-excitation helper buffers - add selectable original/transposed/blocked-k index mappings - simplify same-braIdx accumulation in vectorized alpha-beta kernel - remove experimental hash-based aggregation path - switch alpha-beta rank distribution to contiguous chunks

Introduce an optional index reordering mechanism based on KetIndex to improve data locality while preserving BraIndex ordering. - Add SBD_REORDER_INDEX_ARRAY flag to enable block-based permutation of excitation entries (single/double, alpha/beta). - Implement stable permutation using histogram + prefix-sum + scatter. - Apply permutation consistently to KetIndex, BraIndex, and Cr/An arrays. - Avoid full sorting to prevent randomization of BraIndex, which may negatively impact performance. Refactor MPI work distribution: - Replace SBD_USE_STRIDED_RANK_DISTRIBUTION with SBD_USE_RANK_DISTRIBUTION. - Add SBD_USE_BLOCK_RANK_DISTRIBUTION to support contiguous block assignment. - Support both block (contiguous) and cyclic (strided) distribution via transform iterators. - Add runtime logging for selected distribution mode. Other changes: - Initialize index pointers to nullptr for safety. - Clean up offset handling and conditional increments. - Simplify kernel mapping logic by removing unused variants. - Add <cassert> and improve internal validation via assertions.

- Remove ket_index_maxval parameter from setup_permutation - Compute upper bound (ket_index_limit) using std::max_element - Improve robustness by deriving range directly from actual data

- Turn on SBD_USE_VECTORIZATION - Remove unused vectorized code path - Aggregate same braIdx entries before atomicAdd - Clean up kernel and distribution logic

…oice - introduce SBD_USE_32BIT_PARITY and use __popc-based parity to reduce register pressure - preserve original parity behavior in the new 32-bit path - add runtime checks/logging for 32-bit parity mode - document cache-oriented block_size choice for index reordering - disable vectorization by default

- switch bit_length from 64-bit to 32-bit to avoid slow 64-bit division - simplify parity logic (remove branch, use bitwise parity check) - align 32-bit parity path with updated implementation - tune reorder block_size to 32 for better cache locality

- make parity() generic over sign type (SgnT) - use SgnT in parity computation to avoid conversions - change sgn from double to float in excitation kernels - reduces register usage in GPU kernels

- unify block handling with masks and remove special-case branches - fold start-bit contribution into nonZeroBits Clarify KetIndex-based permutation intent and trade-offs Minor cleanup (redundant check removal, comment updates)

- split CUDA and NCCL helpers into dedicated utility headers - standardize error handling via SBD_CHECK_{CUDA,NCCL,CUBLAS} - refactor cuBLAS helpers and drop unused complex path - migrate all usages to new macros - reorganize configuration flags and document performance-related options

- shift start to avoid explicit start-bit contribution - unify parity logic with popcount-based range counting - reduce branching and instruction overhead - improve readability and document behavior

- Drop unused #if branches and fallback implementations - Guard debug prints with SBD_DEBUG - Add NVTX range and clarify Normalize2 behavior

Restore the default Configuration to the main-branch settings and move NVHPC-specific compiler flags and libraries into Configuration.nvhpc. Add Makefile.nvhpc for NVHPC builds so the default build configuration remains portable.

ikkoham · 2026-05-25T12:08:58Z

+          thrust::device_vector<double> A(W.size(), 0.0);
+          nccl_allreduce(A, ncclSum, a_nccl_comm);
+      }
+      printf("[%s,%d] NCCL communicators have been created.n",


Suggested change

printf("[%s,%d] NCCL communicators have been created.n",

printf("[%s,%d] NCCL communicators have been created.\n",

anaruse added 30 commits March 24, 2026 21:04

Changed make options for NVHPC

1746d5f

Added code to use NVTX3

02f03ae

Added NVTX annotations

894d82c

Improve GPU kernel for MultAlphaBeta

6d1f0d3

Added NVTXs

33f4a1b

Improve GPU kernels in mult.run()

9e441b3

Vectorize MultAlphaBeta

72d461e

[MultAlphaBeta] Reduce atomicAdd to device memory using hash-table

b2c6f62

Aded .gitignore

0ca593d

Merge branch 'main' into work.20260319

e71cff0

Code cleanup

6f30a12

Implement MultUnified that integrates four kernels: MultSingleAlpha, …

dc7b654

…MultiDoubleAlpha, MultSingleBeta, and MultDoubleBeta.

Refactoring code for MultAlphaBeta

2ca63e5

Use MPI_IN_PLACE for MPI_Allreduce()

96fac5e

Add NVTXs and change Configuration

742d520

Change order of MPI_Allreduce()

8863e6b

Add NVTX

0023d36

Add NVTXs

183d19f

Added support for NCCL

e586451

Use cuBLAS for batched inner product

fb05ef8

Small changes

f70d37b

Add support for thrust nosync

756c9ce

Use cuBLAS for Normalization

8b94499

Implement BatchedAXPY_GEMV

6d3ae99

Implement GramSchmidtOrthogonalize_GEMV()

d9c1dfe

Implement Normalize2()

d6f4dd6

Add communicator for allreduce

33779e4

Add nccl_allreduce2 to fuse two allrecude and improve comment

8cb110b

Changed BatchedAXPY_GEMV

2329e89

anaruse added 17 commits April 7, 2026 23:39

Merge branch 'main' into work.20260319

039932f

Infer KetIndex range from data instead of passing max value

1a17f4d

- Remove ket_index_maxval parameter from setup_permutation - Compute upper bound (ket_index_limit) using std::max_element - Improve robustness by deriving range directly from actual data

Enable vectorization and simplify VecLen kernel logic

433eb90

- Turn on SBD_USE_VECTORIZATION - Remove unused vectorized code path - Aggregate same braIdx entries before atomicAdd - Clean up kernel and distribution logic

Template parity and switch sgn to float for lower register pressure

a7787a5

- make parity() generic over sign type (SgnT) - use SgnT in parity computation to avoid conversions - change sgn from double to float in excitation kernels - reduces register usage in GPU kernels

Refactor parity bit counting to reduce branching

5976175

- unify block handling with masks and remove special-case branches - fold start-bit contribution into nonZeroBits Clarify KetIndex-based permutation intent and trade-offs Minor cleanup (redundant check removal, comment updates)

Merge branch 'main' into work.20260421

ae1a9b9

Optimize parity computation by removing start-bit special handling

3bfcb26

- shift start to avoid explicit start-bit contribution - unify parity logic with popcount-based range counting - reduce branching and instruction overhead - improve readability and document behavior

Remove dead code and clean up debug/optimization paths

baec8de

- Drop unused #if branches and fallback implementations - Guard debug prints with SBD_DEBUG - Add NVTX range and clarify Normalize2 behavior

Separate NVHPC build configuration

fe7eb15

Restore the default Configuration to the main-branch settings and move NVHPC-specific compiler flags and libraries into Configuration.nvhpc. Add Makefile.nvhpc for NVHPC builds so the default build configuration remains portable.

Merge branch 'main' into work.20260525x

a90e313

Merge branch 'main.20260525' into work.20260525

b854d82

ikkoham reviewed May 25, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Refactor GPU utilities, improve locality and kernel efficiency, and add profiling support#43

Refactor GPU utilities, improve locality and kernel efficiency, and add profiling support#43
anaruse wants to merge 47 commits into
r-ccs-cms:mainfrom
anaruse:feature/sbd-gpu-optimization-parity-locality

anaruse commented May 25, 2026

Uh oh!

ikkoham May 25, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

	printf("[%s,%d] NCCL communicators have been created.n",
	printf("[%s,%d] NCCL communicators have been created.\n",

Conversation

anaruse commented May 25, 2026

Uh oh!

ikkoham May 25, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants