This chapter provides examples of how to implement a multiGPU linear solver.
The example code enables peer-to-peer access to take advantage of NVLINK. The user can check the performance by turning on/off peer-to-peer access.
The example 1 solves linear system by Cholesky factorization (potrf and potrs). It allocates distributed matrix by calling createMat. Then generates the matrix on host memory and copies it to distributed device memory via memcpyH2D.
The example 2 solves linear system using the inverse of a Hermitian positive definite matrix using (potrf and potri). It allocates distributed matrix by calling createMat. Then generates the matrix on host memory and copies it to distributed device memory via memcpyH2D.
All GPUs supported by CUDA Toolkit (https://developer.nvidia.com/cuda-gpus)
Linux Windows
x86_64 ppc64le arm64-sbsa
- cusolverMgPotrf_bufferSize API
- cusolverMgPotrs_bufferSize API
- cusolverMgPotri_bufferSize API
- cusolverMgPotrf API
- cusolverMgPotrs API
- cusolverMgPotri API
- cusolverMgCreateDeviceGrid API
- cusolverMgDeviceSelect API
- A Linux/Windows system with recent NVIDIA drivers.
- CMake version 3.18 minimum
- Minimum CUDA 10.2 toolkit is required.
$ mkdir build
$ cd build
$ cmake .. # -DSHOW_FORMAT=ON
$ make
Make sure that CMake finds expected CUDA Toolkit. If that is not the case you can add argument -DCMAKE_CUDA_COMPILER=/path/to/cuda/bin/nvcc to cmake command.
$ mkdir build
$ cd build
$ cmake -DCMAKE_GENERATOR_PLATFORM=x64 ..
$ Open cusolver_examples.sln project in Visual Studio and build
$ ./cusolver_MgPotrf_example1
Sample example output w/ 1 GPU:
Test 1D Laplacian of order 8
Step 1: Create Mg handle and select devices
There are 1 GPUs
Device 0, NVIDIA TITAN RTX, cc 7.5
Step 2: Enable peer access
Step 3: Allocate host memory A
Step 4: Prepare 1D Laplacian for A and X = ones(N,NRHS)
Step 5: Create RHS for reference solution on host B = A*X
Step 6: Create matrix descriptors for A and D
Step 7: Allocate distributed matrices A and B
Step 8: Prepare data on devices
Step 9: Allocate workspace space
Allocate device workspace, lwork = 1064960
Step 10: Solve A*X = B by POTRF and POTRS
Step 11: Solution vector B
Step 12: Measure residual error |b - A*x|
errors for X[:,1]
|b - A*x|_inf = 2.220446E-16
|x|_inf = 1.000000E+00
|b|_inf = 1.000000E+00
|A|_inf = 4.000000E+00
|b - A*x|/(|A|*|x|+|b|) = 4.440892E-17
errors for X[:,2]
|b - A*x|_inf = 2.220446E-16
|x|_inf = 1.000000E+00
|b|_inf = 1.000000E+00
|A|_inf = 4.000000E+00
|b - A*x|/(|A|*|x|+|b|) = 4.440892E-17
Step 12: Free resources
Sample example output w/ 2 GPU:
Test 1D Laplacian of order 8
Step 1: Create Mg handle and select devices
There are 2 GPUs
Device 0, NVIDIA TITAN RTX, cc 7.5
Device 1, NVIDIA TITAN RTX, cc 7.5
Step 2: Enable peer access
Enable peer access from gpu 0 to gpu 1
Enable peer access from gpu 1 to gpu 0
Step 3: Allocate host memory A
Step 4: Prepare 1D Laplacian for A and X = ones(N,NRHS)
Step 5: Create RHS for reference solution on host B = A*X
Step 6: Create matrix descriptors for A and D
Step 7: Allocate distributed matrices A and B
Step 8: Prepare data on devices
Step 9: Allocate workspace space
Allocate device workspace, lwork = 1064960
Step 10: Solve A*X = B by POTRF and POTRS
Step 11: Solution vector B
Step 12: Measure residual error |b - A*x|
errors for X[:,1]
|b - A*x|_inf = 2.220446E-16
|x|_inf = 1.000000E+00
|b|_inf = 1.000000E+00
|A|_inf = 4.000000E+00
|b - A*x|/(|A|*|x|+|b|) = 4.440892E-17
errors for X[:,2]
|b - A*x|_inf = 2.220446E-16
|x|_inf = 1.000000E+00
|b|_inf = 1.000000E+00
|A|_inf = 4.000000E+00
|b - A*x|/(|A|*|x|+|b|) = 4.440892E-17
Step 12: Free resources
$ ./cusolver_MgPotrf_example2
Sample example output w/ 1 GPU:
Test 1D Laplacian of order 8
Step 1: Create Mg handle and select devices
There are 1 GPUs
Device 0, NVIDIA TITAN RTX, cc 7.5
Step 2: Enable peer access
Step 3: Allocate host memory A
Step 4: Prepare 1D Laplacian for A and Xref = ones(N,NRHS)
Step 5: Create RHS for reference solution on host B = A*X
Step 6: Create matrix descriptors for A and D
Step 7: Allocate distributed matrices A and B
Step 8: Prepare data on devices
Step 9: Allocate workspace space
Allocate device workspace, lwork = 1067008
Step 10: Solve A*X = B by POTRF and POTRI
Step 11: Gather INV(A) from devices to host
Step 12: Solve linear system B := inv(A) * B
Step 13: Measure residual error |Xref - Xans|
errors for X[:,1]
|b - A*x|_inf = 4.440892E-16
|Xref|_inf = 1.000000E+00
|Xans|_inf = 1.000000E+00
|A|_inf = 4.000000E+00
|b - A*x|/(|A|*|x|+|b|) = 8.881784E-17
errors for X[:,2]
|b - A*x|_inf = 4.440892E-16
|Xref|_inf = 1.000000E+00
|Xans|_inf = 1.000000E+00
|A|_inf = 4.000000E+00
|b - A*x|/(|A|*|x|+|b|) = 8.881784E-17
Step 14: Free resources
Sample example output w/ 2 GPU:
Test 1D Laplacian of order 8
Step 1: Create Mg handle and select devices
There are 2 GPUs
Device 0, NVIDIA TITAN RTX, cc 7.5
Device 1, NVIDIA TITAN RTX, cc 7.5
Step 2: Enable peer access
Enable peer access from gpu 0 to gpu 1
Enable peer access from gpu 1 to gpu 0
Step 3: Allocate host memory A
Step 4: Prepare 1D Laplacian for A and Xref = ones(N,NRHS)
Step 5: Create RHS for reference solution on host B = A*X
Step 6: Create matrix descriptors for A and D
Step 7: Allocate distributed matrices A and B
Step 8: Prepare data on devices
Step 9: Allocate workspace space
Allocate device workspace, lwork = 1067008
Step 10: Solve A*X = B by POTRF and POTRI
Step 11: Gather INV(A) from devices to host
Step 12: Solve linear system B := inv(A) * B
Step 13: Measure residual error |Xref - Xans|
errors for X[:,1]
|b - A*x|_inf = 4.440892E-16
|Xref|_inf = 1.000000E+00
|Xans|_inf = 1.000000E+00
|A|_inf = 4.000000E+00
|b - A*x|/(|A|*|x|+|b|) = 8.881784E-17
errors for X[:,2]
|b - A*x|_inf = 4.440892E-16
|Xref|_inf = 1.000000E+00
|Xans|_inf = 1.000000E+00
|A|_inf = 4.000000E+00
|b - A*x|/(|A|*|x|+|b|) = 8.881784E-17
Step 14: Free resources