Skip to content

Refactor GPU utilities, improve locality and kernel efficiency, and add profiling support#43

Open
anaruse wants to merge 47 commits into
r-ccs-cms:mainfrom
anaruse:feature/sbd-gpu-optimization-parity-locality
Open

Refactor GPU utilities, improve locality and kernel efficiency, and add profiling support#43
anaruse wants to merge 47 commits into
r-ccs-cms:mainfrom
anaruse:feature/sbd-gpu-optimization-parity-locality

Conversation

@anaruse

@anaruse anaruse commented May 25, 2026

Copy link
Copy Markdown

This PR introduces a set of refactors and optimizations for the SBD GPU implementation, focusing on improving data locality, kernel efficiency, and maintainability.

Key changes

  1. Data locality improvement
  • Introduce block-based index reordering based on KetIndex (SBD_REORDER_INDEX_ARRAY)
  • Improve memory access locality without breaking BraIndex ordering
  • Shift bottleneck from memory access to compute
  1. Kernel efficiency improvements
  • Reduce register pressure:
    • 32-bit parity path (SBD_USE_32BIT_PARITY)
    • use float for sign handling
  • Reduce branching and simplify parity logic
  • Avoid costly 64-bit division in bit operations
  • Enable optional thrust::par_nosync execution (SBD_USE_THRUST_NOSYNC)
  1. Communication optimization
  • Add NCCL-based collectives (SBD_USE_NCCL)
  • Fuse reductions (e.g., nccl_allreduce2) to reduce launch overhead
  • Introduce a_comm communicator for better communication grouping
  • Support configurable MPI rank distribution (block vs cyclic)
  1. cuBLAS-based optimizations
  • Use GEMV-based batched operations:
    • Batched inner products
    • Batched AXPY
  • Reduce kernel launch overhead and improve memory efficiency
  1. Utility refactor (major cleanup)
  • Introduce:
    • cuda_utility.h
    • nccl_utility.h
    • cublas_utility.h
  • Unify error handling:
    • SBD_CHECK_CUDA
    • SBD_CHECK_NCCL
    • SBD_CHECK_CUBLAS
  1. Profiling and instrumentation
  • Add NVTX annotations across key regions:
    • Davidson loop
    • thrust operations
    • MPI/NCCL calls
  • Enable detailed timeline analysis with Nsight Systems
  1. Build / configuration cleanup
  • Consolidate performance-related macros in Configuration
  • Add inline documentation for each flag
  • Set NVHPC-based configuration as default

Notes

  • Optimizations follow a stepwise approach:
    1. Improve locality → memory bottleneck reduced
    2. Compute becomes bottleneck → optimize kernels
  • Some features are controlled via compile-time flags for flexibility.

This PR improves both performance and code maintainability while preparing the codebase for further optimization and sharing.

anaruse added 30 commits March 24, 2026 21:04
…MultiDoubleAlpha, MultSingleBeta, and MultDoubleBeta.
… NCCL-based reductions

- Switch Davidson basis storage to flat device buffers (C/HC) to avoid repeated packing
- Replace batched inner products and AXPY updates with GEMV on contiguous memory
- Introduce NCCL-based reduction in Gram-Schmidt orthogonalization (remove host transfers)
- Remove unnecessary host-side operations (e.g., negate) by adjusting GEMV formulation
- Improve overlap of compute and communication by using CUDA streams
anaruse added 17 commits April 7, 2026 23:39
- Extend Normalize/Normalize2 to avoid unnecessary MPI_Allreduce calls when comm size is 1
- Add device-pointer version of cuBLAS dot to enable GPU-side accumulation
- Reduce host-device synchronization in Normalize2 using optional workspace
- Remove redundant pre-normalization scaling and apply correction to norms instead
- Minor cleanup and workspace usage improvements
- add optional maxregcount build flag
- add debug prints for single-excitation helper buffers
- add selectable original/transposed/blocked-k index mappings
- simplify same-braIdx accumulation in vectorized alpha-beta kernel
- remove experimental hash-based aggregation path
- switch alpha-beta rank distribution to contiguous chunks
Introduce an optional index reordering mechanism based on KetIndex to improve
data locality while preserving BraIndex ordering.

- Add SBD_REORDER_INDEX_ARRAY flag to enable block-based permutation of
  excitation entries (single/double, alpha/beta).
- Implement stable permutation using histogram + prefix-sum + scatter.
- Apply permutation consistently to KetIndex, BraIndex, and Cr/An arrays.
- Avoid full sorting to prevent randomization of BraIndex, which may
  negatively impact performance.

Refactor MPI work distribution:

- Replace SBD_USE_STRIDED_RANK_DISTRIBUTION with SBD_USE_RANK_DISTRIBUTION.
- Add SBD_USE_BLOCK_RANK_DISTRIBUTION to support contiguous block assignment.
- Support both block (contiguous) and cyclic (strided) distribution via
  transform iterators.
- Add runtime logging for selected distribution mode.

Other changes:

- Initialize index pointers to nullptr for safety.
- Clean up offset handling and conditional increments.
- Simplify kernel mapping logic by removing unused variants.
- Add <cassert> and improve internal validation via assertions.
- Remove ket_index_maxval parameter from setup_permutation
- Compute upper bound (ket_index_limit) using std::max_element
- Improve robustness by deriving range directly from actual data
- Turn on SBD_USE_VECTORIZATION
- Remove unused vectorized code path
- Aggregate same braIdx entries before atomicAdd
- Clean up kernel and distribution logic
…oice

- introduce SBD_USE_32BIT_PARITY and use __popc-based parity to reduce
  register pressure
- preserve original parity behavior in the new 32-bit path
- add runtime checks/logging for 32-bit parity mode
- document cache-oriented block_size choice for index reordering
- disable vectorization by default
- switch bit_length from 64-bit to 32-bit to avoid slow 64-bit division
- simplify parity logic (remove branch, use bitwise parity check)
- align 32-bit parity path with updated implementation
- tune reorder block_size to 32 for better cache locality
- make parity() generic over sign type (SgnT)
- use SgnT in parity computation to avoid conversions
- change sgn from double to float in excitation kernels
- reduces register usage in GPU kernels
- unify block handling with masks and remove special-case branches
- fold start-bit contribution into nonZeroBits

Clarify KetIndex-based permutation intent and trade-offs

Minor cleanup (redundant check removal, comment updates)
- split CUDA and NCCL helpers into dedicated utility headers
- standardize error handling via SBD_CHECK_{CUDA,NCCL,CUBLAS}
- refactor cuBLAS helpers and drop unused complex path
- migrate all usages to new macros
- reorganize configuration flags and document performance-related options
- shift start to avoid explicit start-bit contribution
- unify parity logic with popcount-based range counting
- reduce branching and instruction overhead
- improve readability and document behavior
- Drop unused #if branches and fallback implementations
- Guard debug prints with SBD_DEBUG
- Add NVTX range and clarify Normalize2 behavior
Restore the default Configuration to the main-branch settings and move
NVHPC-specific compiler flags and libraries into Configuration.nvhpc.

Add Makefile.nvhpc for NVHPC builds so the default build configuration remains
portable.
thrust::device_vector<double> A(W.size(), 0.0);
nccl_allreduce(A, ncclSum, a_nccl_comm);
}
printf("[%s,%d] NCCL communicators have been created.n",

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
printf("[%s,%d] NCCL communicators have been created.n",
printf("[%s,%d] NCCL communicators have been created.\n",

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants