Refactor GPU utilities, improve locality and kernel efficiency, and add profiling support#43
Open
anaruse wants to merge 47 commits into
Open
Conversation
…MultiDoubleAlpha, MultSingleBeta, and MultDoubleBeta.
… NCCL-based reductions - Switch Davidson basis storage to flat device buffers (C/HC) to avoid repeated packing - Replace batched inner products and AXPY updates with GEMV on contiguous memory - Introduce NCCL-based reduction in Gram-Schmidt orthogonalization (remove host transfers) - Remove unnecessary host-side operations (e.g., negate) by adjusting GEMV formulation - Improve overlap of compute and communication by using CUDA streams
- Extend Normalize/Normalize2 to avoid unnecessary MPI_Allreduce calls when comm size is 1 - Add device-pointer version of cuBLAS dot to enable GPU-side accumulation - Reduce host-device synchronization in Normalize2 using optional workspace - Remove redundant pre-normalization scaling and apply correction to norms instead - Minor cleanup and workspace usage improvements
- add optional maxregcount build flag - add debug prints for single-excitation helper buffers - add selectable original/transposed/blocked-k index mappings - simplify same-braIdx accumulation in vectorized alpha-beta kernel - remove experimental hash-based aggregation path - switch alpha-beta rank distribution to contiguous chunks
Introduce an optional index reordering mechanism based on KetIndex to improve data locality while preserving BraIndex ordering. - Add SBD_REORDER_INDEX_ARRAY flag to enable block-based permutation of excitation entries (single/double, alpha/beta). - Implement stable permutation using histogram + prefix-sum + scatter. - Apply permutation consistently to KetIndex, BraIndex, and Cr/An arrays. - Avoid full sorting to prevent randomization of BraIndex, which may negatively impact performance. Refactor MPI work distribution: - Replace SBD_USE_STRIDED_RANK_DISTRIBUTION with SBD_USE_RANK_DISTRIBUTION. - Add SBD_USE_BLOCK_RANK_DISTRIBUTION to support contiguous block assignment. - Support both block (contiguous) and cyclic (strided) distribution via transform iterators. - Add runtime logging for selected distribution mode. Other changes: - Initialize index pointers to nullptr for safety. - Clean up offset handling and conditional increments. - Simplify kernel mapping logic by removing unused variants. - Add <cassert> and improve internal validation via assertions.
- Remove ket_index_maxval parameter from setup_permutation - Compute upper bound (ket_index_limit) using std::max_element - Improve robustness by deriving range directly from actual data
- Turn on SBD_USE_VECTORIZATION - Remove unused vectorized code path - Aggregate same braIdx entries before atomicAdd - Clean up kernel and distribution logic
…oice - introduce SBD_USE_32BIT_PARITY and use __popc-based parity to reduce register pressure - preserve original parity behavior in the new 32-bit path - add runtime checks/logging for 32-bit parity mode - document cache-oriented block_size choice for index reordering - disable vectorization by default
- switch bit_length from 64-bit to 32-bit to avoid slow 64-bit division - simplify parity logic (remove branch, use bitwise parity check) - align 32-bit parity path with updated implementation - tune reorder block_size to 32 for better cache locality
- make parity() generic over sign type (SgnT) - use SgnT in parity computation to avoid conversions - change sgn from double to float in excitation kernels - reduces register usage in GPU kernels
- unify block handling with masks and remove special-case branches - fold start-bit contribution into nonZeroBits Clarify KetIndex-based permutation intent and trade-offs Minor cleanup (redundant check removal, comment updates)
- split CUDA and NCCL helpers into dedicated utility headers
- standardize error handling via SBD_CHECK_{CUDA,NCCL,CUBLAS}
- refactor cuBLAS helpers and drop unused complex path
- migrate all usages to new macros
- reorganize configuration flags and document performance-related options
- shift start to avoid explicit start-bit contribution - unify parity logic with popcount-based range counting - reduce branching and instruction overhead - improve readability and document behavior
- Drop unused #if branches and fallback implementations - Guard debug prints with SBD_DEBUG - Add NVTX range and clarify Normalize2 behavior
Restore the default Configuration to the main-branch settings and move NVHPC-specific compiler flags and libraries into Configuration.nvhpc. Add Makefile.nvhpc for NVHPC builds so the default build configuration remains portable.
ikkoham
reviewed
May 25, 2026
| thrust::device_vector<double> A(W.size(), 0.0); | ||
| nccl_allreduce(A, ncclSum, a_nccl_comm); | ||
| } | ||
| printf("[%s,%d] NCCL communicators have been created.n", |
There was a problem hiding this comment.
Suggested change
| printf("[%s,%d] NCCL communicators have been created.n", | |
| printf("[%s,%d] NCCL communicators have been created.\n", |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
This PR introduces a set of refactors and optimizations for the SBD GPU implementation, focusing on improving data locality, kernel efficiency, and maintainability.
Key changes
SBD_REORDER_INDEX_ARRAY)SBD_USE_32BIT_PARITY)floatfor sign handlingthrust::par_nosyncexecution (SBD_USE_THRUST_NOSYNC)SBD_USE_NCCL)nccl_allreduce2) to reduce launch overheada_commcommunicator for better communication groupingcuda_utility.hnccl_utility.hcublas_utility.hSBD_CHECK_CUDASBD_CHECK_NCCLSBD_CHECK_CUBLASNotes
This PR improves both performance and code maintainability while preparing the codebase for further optimization and sharing.