Skip to content
Merged
Show file tree
Hide file tree
Changes from 47 commits
Commits
Show all changes
48 commits
Select commit Hold shift + click to select a range
1746d5f
Changed make options for NVHPC
anaruse Mar 24, 2026
02f03ae
Added code to use NVTX3
anaruse Mar 24, 2026
894d82c
Added NVTX annotations
anaruse Mar 24, 2026
6d1f0d3
Improve GPU kernel for MultAlphaBeta
anaruse Mar 24, 2026
33f4a1b
Added NVTXs
anaruse Mar 25, 2026
9e441b3
Improve GPU kernels in mult.run()
anaruse Mar 25, 2026
72d461e
Vectorize MultAlphaBeta
anaruse Mar 26, 2026
b2c6f62
[MultAlphaBeta] Reduce atomicAdd to device memory using hash-table
anaruse Mar 26, 2026
0ca593d
Aded .gitignore
anaruse Mar 26, 2026
e71cff0
Merge branch 'main' into work.20260319
anaruse Mar 26, 2026
6f30a12
Code cleanup
anaruse Mar 26, 2026
dc7b654
Implement MultUnified that integrates four kernels: MultSingleAlpha, …
anaruse Mar 27, 2026
2ca63e5
Refactoring code for MultAlphaBeta
anaruse Mar 27, 2026
96fac5e
Use MPI_IN_PLACE for MPI_Allreduce()
anaruse Mar 30, 2026
742d520
Add NVTXs and change Configuration
anaruse Mar 30, 2026
8863e6b
Change order of MPI_Allreduce()
anaruse Mar 30, 2026
0023d36
Add NVTX
anaruse Mar 30, 2026
183d19f
Add NVTXs
anaruse Mar 30, 2026
e586451
Added support for NCCL
anaruse Mar 31, 2026
fb05ef8
Use cuBLAS for batched inner product
anaruse Apr 1, 2026
f70d37b
Small changes
anaruse Apr 1, 2026
756c9ce
Add support for thrust nosync
anaruse Apr 1, 2026
8b94499
Use cuBLAS for Normalization
anaruse Apr 2, 2026
6d3ae99
Implement BatchedAXPY_GEMV
anaruse Apr 2, 2026
d9c1dfe
Implement GramSchmidtOrthogonalize_GEMV()
anaruse Apr 2, 2026
d6f4dd6
Implement Normalize2()
anaruse Apr 2, 2026
33779e4
Add communicator for allreduce
anaruse Apr 2, 2026
8cb110b
Add nccl_allreduce2 to fuse two allrecude and improve comment
anaruse Apr 3, 2026
2329e89
Changed BatchedAXPY_GEMV
anaruse Apr 3, 2026
50257c2
Optimize SBD GPU kernels by eliminating packing overhead and enabling…
anaruse Apr 7, 2026
92464dc
Optimize normalization and dot operations in SBD Davidson solver
anaruse Apr 7, 2026
039932f
Merge branch 'main' into work.20260319
anaruse Apr 9, 2026
d321061
TPB: refactor vectorized thrust path for performance experiments
anaruse Apr 15, 2026
8331a5f
Add block-based index reordering and configurable MPI rank distribution
anaruse Apr 16, 2026
1a17f4d
Infer KetIndex range from data instead of passing max value
anaruse Apr 16, 2026
433eb90
Enable vectorization and simplify VecLen kernel logic
anaruse Apr 16, 2026
09319ea
Add 32-bit parity implementation and clarify reordering block-size ch…
anaruse Apr 17, 2026
7cf3bcf
Optimize parity and reduce cost of bit_length operations
anaruse Apr 17, 2026
a7787a5
Template parity and switch sgn to float for lower register pressure
anaruse Apr 17, 2026
5976175
Refactor parity bit counting to reduce branching
anaruse Apr 20, 2026
ae1a9b9
Merge branch 'main' into work.20260421
anaruse Apr 21, 2026
e45a34d
Modularize GPU utilities and standardize error handling
anaruse Apr 21, 2026
3bfcb26
Optimize parity computation by removing start-bit special handling
anaruse Apr 21, 2026
baec8de
Remove dead code and clean up debug/optimization paths
anaruse Apr 22, 2026
fe7eb15
Separate NVHPC build configuration
anaruse Apr 27, 2026
a90e313
Merge branch 'main' into work.20260525x
anaruse May 25, 2026
b854d82
Merge branch 'main.20260525' into work.20260525
anaruse May 25, 2026
2005895
Update include/sbd/chemistry/tpb/sbdiag.h
t-sirakawa Jun 16, 2026
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 2 additions & 0 deletions apps/chemistry_tpb_selected_basis_diagonalization/.gitignore
Original file line number Diff line number Diff line change
@@ -0,0 +1,2 @@
diag.*
logs.*
Original file line number Diff line number Diff line change
@@ -0,0 +1,60 @@
# Path to the SBD library
SBD_PATH=../..

# mpi c++ compiler
# CCCOM=mpicxx
# CCCOM=/opt/nvidia/hpc_sdk/Linux_x86_64/2025/comm_libs/mpi/bin/mpic++

# flags for build: include path to openmp. The following is the case using homebrew's llvm on Mac
# CCFLAGS= -std=c++17 -stdlib=libc++ -fopenmp -I/opt/homebrew/opt/llvm/include -O3


# accelerate using Thrust
# CCFLAGS= -mp -cuda -fast -Minfo=accel --diag_suppress declared_but_not_referenced,set_but_not_used -fmax-errors=0 -I/opt/nvidia/hpc_sdk/Linux_x86_64/2025/cuda/include/cccl -I/usr/local/cuda/include -DSBD_THRUST
#-DSBD_PREFECT
#-DSBD_DEBUG_MULT
#-DSBD_THRUST_NO_COLLAPSE

# Specify -gpu=mem:unified option on Grace Hopper environment (?)
# std::vector between diag and mult will be lost without this option (???)
# CCFLAGS= -cuda -Minfo=accel -gpu=mem:unified --diag_suppress declared_but_not_referenced,set_but_not_used -fmax-errors=0 -I/opt/nvidia/hpc_sdk/Linux_x86_64/2025/cuda/include/cccl -I/usr/local/cuda/include -DSBD_THRUST

# CPU run
#CCFLAGS= -mp --diag_suppress declared_but_not_referenced,set_but_not_used -fmax-errors=0 -I/opt/nvidia/hpc_sdk/Linux_x86_64/2025/cuda/include/cccl -I/usr/local/cuda/include -DSBD_DEBUG_MULT


# flags for linking: include link to lapack and blas. The following s the case using homebrew's openblas on Mac
# SYSLIB= -L/opt/homebrew/opt/openblas/lib -llapack -lblas
# SYSLIB= -llapack -lblas

### Example for the Fugaku
# SBD_PATH=../../
# CCCOM=mpiFCCpx
# CCFLAGS= -Nclang -std=c++17 -stdlib=libc++ -Kfast,openmp -Xpreprocessor -fopenmp
# SYSLIB= -SSL2
#
### Trad-mode for Fugaku
# SBD_PATH=../../
# CCCOM=mpiFCCpx
# CCFLAGS= -std=c++17 -Kfast,openmp -DSBD_TRADMODE
# SYSLIB= -SSL2

# **** NVHPC ****
CCCOM=mpic++
CCFLAGS= -std=c++17 -mp -cuda -fast -Minfo=accel --diag_suppress declared_but_not_referenced,set_but_not_used -fmax-errors=0 -I/usr/local/cuda/include -DSBD_THRUST
SYSLIB= -llapack -lblas
# CCFLAGS+= -DNDEBUG
# CCFLAGS+= -acc=gpu -gpu=maxregcount:64,ptxinfo
CCFLAGS+= -DSBD_THRUST_SAFE_MPI_ALLREDUCE
CCFLAGS+= -DSBD_USE_NVTX # Enable NVTX annotations for profiling (Nsight Systems)
CCFLAGS+= -DSBD_USE_THRUST_NOSYNC # Disable implicit sync in thrust execution to improve concurrency and performance
CCFLAGS+= -DSBD_USE_NCCL # Enable NCCL-based GPU collective communication
SYSLIB+= -lnccl
CCFLAGS+= -DSBD_USE_CUBLAS # Enable cuBLAS for GPU-accelerated linear algebra operations
SYSLIB+= -lcublas
CCFLAGS+= -DSBD_USE_RANK_DISTRIBUTION # Enable configurable MPI rank distribution strategy
CCFLAGS+= -DSBD_USE_BLOCK_RANK_DISTRIBUTION # With SBD_USE_RANK_DISTRIBUTION: use block (contiguous) assignment
# Otherwise: default is cyclic (strided) distribution
# CCFLAGS+= -DSBD_USE_VECTORIZATION # Enable vectorized execution (e.g., multi-element per thread)
CCFLAGS+= -DSBD_REORDER_INDEX_ARRAY # Apply block-based index reordering to improve data locality
CCFLAGS+= -DSBD_USE_32BIT_PARITY # Use 32-bit popcount-based parity to reduce cost and register pressure
23 changes: 23 additions & 0 deletions apps/chemistry_tpb_selected_basis_diagonalization/Makefile.nvhpc
Original file line number Diff line number Diff line change
@@ -0,0 +1,23 @@
include Configuration.nvhpc
SBD_INCLUDE_DIR=$(SBD_PATH)/include
LIBFLAGS= $(SYSLIB)
MAKEFILES= Makefile.nvhpc Configuration.nvhpc
# header file
HEADER=
# source
SOURCES= main.cc
#objects
OBJECTS=
# compilation
.SUFFIXES:
.SUFFIXES: .h .cc .o
.cc.o: $*.cc
$(CCCOM) -c $(CCFLAGS) -I$(SBD_INCLUDE_DIR) $<
###############################################################
diag: clean main.o $(OBJECTS)
$(CCCOM) $(CCFLAGS) -o diag main.o $(OBJECTS) $(LIBFLAGS)

clean:
rm -f main.o

###############################################################
15 changes: 11 additions & 4 deletions apps/chemistry_tpb_selected_basis_diagonalization/main.cc
Original file line number Diff line number Diff line change
Expand Up @@ -12,6 +12,7 @@
#include "sbd/sbd.h"
#include "mpi.h"

#include "sbd/framework/nvtx.h"

int main(int argc, char * argv[]) {

Expand Down Expand Up @@ -96,8 +97,11 @@ int main(int argc, char * argv[]) {
/**
sample-based diagonalization using fcidump file and adet file
*/
sbd::tpb::diag(comm,sbd_data,fcifumpfile,adetfile,loadname,savename,
energy,density,co_adet,co_bdet,one_p_rdm,two_p_rdm);
{
SBD_NVTX_RANGE("diag", __LINE__);
sbd::tpb::diag(comm,sbd_data,fcifumpfile,adetfile,loadname,savename,
energy,density,co_adet,co_bdet,one_p_rdm,two_p_rdm);
}

/**
Get L (number of orbitals) and N (number of electrons) from fcidump data for output
Expand Down Expand Up @@ -181,8 +185,11 @@ int main(int argc, char * argv[]) {
/**
sample-based diagonalization using data for fcidump, adet, bdet.
*/
sbd::tpb::diag(comm,sbd_data,fcidump,adet,bdet,loadname,savename,
energy,density,co_adet,co_bdet,one_p_rdm,two_p_rdm);
{
SBD_NVTX_RANGE("diag");
sbd::tpb::diag(comm,sbd_data,fcidump,adet,bdet,loadname,savename,
energy,density,co_adet,co_bdet,one_p_rdm,two_p_rdm);
}

#endif

Expand Down
Loading