Skip to content

2022-07-07-rev703 - Corresponds to publication Dong et al, 2024

Choose a tag to compare

@MiCurry MiCurry released this 09 Feb 20:13
· 156 commits to main since this release

Note: This release is a release on the main (SVN stable-) branch. This is not intended to be a robust release. Even though Release 2023-11-19-rev755 comes after this release, it was based on the stable branch and does not include these changes.

Introducing GPU solvers. Please refer to https://github.com/dong-hao/ModEM-GPU for bug updates based on this revision. Please follow the main branch for further development of this functionality.

Dong, H., Sun, K., Egbert, G. D., Kelbert, A., & Meqbel, N. (2024). Hybrid CPU-GPU solution to regularized divergence-free curl-curl equations for electromagnetic inversion problems. Computers and Geosciences, 184, 05518, Elsevier. https://doi.org/10.1016/j.cageo.2024.105518

-rev630 (89c3215):

introducing the new GPU solvers -

this (hopefully) will not affect the behavior of CPU solvers. for now I only
implemented the GPU solver in the SP2 version - which is the most efficient CPU
version so far.

Depending on your hardware, the GPU vs CPU speed-up can be anywhere between a
few times (for old/weak GPUs) to tens of times for professional ones.
The code has been tested on various computers from my old laptop (bought in
2016) to a brand-new multi-GPU workstation, with GCC+Gfortran 7/9 and CUDA 10/11

Also note that this apparently only works for NVIDIA cards, as the
implementation is through CUDA. Other "universal" interfaces like Kokkos or
OpenCL may be prefered for other GPUs - however, those interfaces do not
provide the ability to implement kernel-level improvements, yet.

Those who want to use the code are welcomed to try - although one should bear
in mind that the GPU libs and CUDA are rather user-hostile to set-up. See
Makefile.gpu file for some basic example on how to compile the code. You have
been warned...

3D_MT/FWD_SP2/solver.f90:

see cuBiCG/cuBiCGmix subroutines for the details of the new solvers
GPU version of the BiCG solver (should) yield consistent results as the CPU
BiCG. the default GPU solver is now the double precision version.

3D_MT/FWD_SP2/EMsolve3D.f90:

added a method to enable using multiple CPUs with one GPU - to mitigate the
impact of the overheads from the CPU side. This is important for an efficient
hybrid CPU-GPU parallel infrastructure.

3D_MT/FWD_SP2/cudaFortMap.f90:

added this file for the translation of CUDA-C interfaces to Fortran. As we can
only write kernels in C, this is kind of inevitable.

3D_MT/FWD_SP2/kernel_c.cu:

added all custom CUDA-C kernel codes here. I never imagined myself writing C
again after all these years, hmmmmm.

-rev648 (b39f69c):

INV/NLCG.f90 INV/LBFGS.f90
a clean-up of the debug and stub codes during my experiments for the
linesearch schemes and LBFGS solver. also removed the LBFGSsolver2 for inversions in smooth model space

Note the default linesearch routine is "cubic" for NLCG and "wolfe2" for
LBFGS, the wolfe2 uses only one forward and one adjoint calculations per
iteration. So it should be 1/3 faster than the "wolfe" one (2 fwds and 1trn)

One problem for wolfe2 is the in-exact line search (not enough descend)
may cause an early iteration finish (or an early update of lambda)

if you are interested, you can also try the PR+ restart criteria
for NLCG (comment the two PR lines and uncomment the PR+ ones)
From my personal experiences, they are slightly faster than the
default "PR" criteria for NLCG

Mod3DMT.f90/Mod2DMT.f90
updated the corresponding interfaces (from LBFGSsolver2 to LBFGSsolver)

-rev675 (d127dae):

A couple of important updates on the two-layered parallel structure (as discussed in the GPU paper).

MPI/Declaration_MPI.f90, Main_MPI.f90:
The most important feature is a topology-based parallel paradigm. We can now get to know the number of GPU devices on a physical node, which allows an easier multiple GPU usage on multiple machines.
The other feature is a node-based parallel method, controlled by a master switch (para_method) in Declaration_MPI.f90. The basic idea is to use the bandwidth of one entire server node to, well, accelerate one single FWD/ADJ task. It allows one to utilize tens of computational servers at once (with the PETSc routines, of course). That’s probably the best way to get maximum throughput if one doesn’t have any GPUs at hand.
That can, of course, singlehandedly dry out your CPU-hour deposit - but I guess it worth it if you can get a total bandwidth with tens of Terabytes and get the job done faster. But note that this is turned off on default (set para_method = 1 and re-compile to try it).

3D_MT/FWD_SPETSc2/ EMsolve3D.f90, modelOperator3D.f90:
I also tried to make the PETSc version compatible with the new configuration format. Not sure if it works correctly, though.

Makefiles:
updated the tempelate makefiles (probably only used by myself for quick switch of branches).

-rev699 (7ec8bc4):

3D_MT/FWD_SP2/EMsolve3D.f90 solver.f90
fixed a bug when using solvers other than BiCG with GPU - as there wasn't any other GPU solvers at all (TFQMR should not be very hard to implement, as we already have the cpu version)...
The code now gives a warning if selecting QMR or TFQMR - and fall back to CPU solvers. Also fixed a silly typo that caused the fortran-c interoperability to fail.

TODO:
there is still no means to configure the GPU version with fmkmf.perl, as the code now involves a CUDA cpp compiler (NVCC) - need to do it by hand for now - i.e. $CC and $C_FLAG thingy. See Makefile.gpu for an example.

-rev703 (0b9626d):

a quick update to setup the default GPU solver to be using
deterministic (but slightly slower) algorithm
MT3D/FWD_SP2/solver.f90
use CUSPARSE_SPMV_CSR_ALG2 to replace ALG1