|
| 1 | +# GPU-Accelerated Linear Solving in Julia |
| 2 | + |
| 3 | +LinearSolve.jl provides two ways to GPU accelerate linear solves: |
| 4 | + |
| 5 | +* Offloading: offloading takes a CPU-based problem and automatically transforms it into a |
| 6 | + GPU-based problem in the background, and returns the solution on CPU. Thus using |
| 7 | + offloading requires no change on the part of the user other than to choose an offloading |
| 8 | + solver. |
| 9 | +* Array type interface: the array type interface requires that the user defines the |
| 10 | + `LinearProblem` using an `AbstractGPUArray` type and chooses an appropriate solver |
| 11 | + (or uses the default solver). The solution will then be returned as a GPU array type. |
| 12 | + |
| 13 | +The offloading approach has the advantage of being simpler and requiring no change to |
| 14 | +existing CPU code, while having the disadvantage of having more overhead. In the following |
| 15 | +sections we will demonstrate how to use each of the approaches. |
| 16 | + |
| 17 | +!!! warn |
| 18 | + |
| 19 | + GPUs are not always faster! Your matrices need to be sufficiently large in order for |
| 20 | + GPU accelerations to actually be faster. For offloading it's around 1,000 x 1,000 matrices |
| 21 | + and for Array type interface it's around 100 x 100. For sparse matrices, it is highly |
| 22 | + dependent on the sparsity pattern and the amount of fill-in. |
| 23 | + |
| 24 | +## GPU-Offloading |
| 25 | + |
| 26 | +GPU offloading is simple as it's done simply by changing the solver algorithm. Take the |
| 27 | +example from the start of the documentation: |
| 28 | + |
| 29 | +```julia |
| 30 | +using LinearSolve |
| 31 | + |
| 32 | +A = rand(4, 4) |
| 33 | +b = rand(4) |
| 34 | +prob = LinearProblem(A, b) |
| 35 | +sol = solve(prob) |
| 36 | +sol.u |
| 37 | +``` |
| 38 | + |
| 39 | +This computation can be moved to the GPU by the following: |
| 40 | + |
| 41 | +```julia |
| 42 | +using CUDA # Add the GPU library |
| 43 | +sol = solve(prob, CudaOffloadFactorization()) |
| 44 | +sol.u |
| 45 | +``` |
| 46 | + |
| 47 | +## GPUArray Interface |
| 48 | + |
| 49 | +For more manual control over the factorization setup, you can use the |
| 50 | +[GPUArray interface](https://juliagpu.github.io/GPUArrays.jl/dev/), the most common |
| 51 | +instantiation being [CuArray for CUDA-based arrays on NVIDIA GPUs](https://cuda.juliagpu.org/stable/usage/array/). |
| 52 | +To use this, we simply send the matrix `A` and the value `b` over to the GPU and solve: |
| 53 | + |
| 54 | +```julia |
| 55 | +using CUDA |
| 56 | + |
| 57 | +A = rand(4, 4) |> cu |
| 58 | +b = rand(4) |> cu |
| 59 | +prob = LinearProblem(A, b) |
| 60 | +sol = solve(prob) |
| 61 | +sol.u |
| 62 | +``` |
| 63 | + |
| 64 | +``` |
| 65 | +4-element CuArray{Float32, 1, CUDA.DeviceMemory}: |
| 66 | + -27.02665 |
| 67 | + 16.338171 |
| 68 | + -77.650116 |
| 69 | + 106.335686 |
| 70 | +``` |
| 71 | + |
| 72 | +Notice that the solution is a `CuArray`, and thus one must use `Array(sol.u)` if you with |
| 73 | +to return it to the CPU. This setup does no automated memory transfers and will thus only |
| 74 | +move things to CPU on command. |
| 75 | + |
| 76 | +!!! warn |
| 77 | + |
| 78 | + Many GPU functionalities, such as `CUDA.cu`, have a built-in preference for `Float32`. |
| 79 | + Generally it is much faster to use 32-bit floating point operations on GPU than 64-bit |
| 80 | + operations, and thus this is generally the right choice if going to such platforms. |
| 81 | + However, this change in numerical precision needs to be accounted for in your mathematics |
| 82 | + as it could lead to instabilities. To disable this, use a constructor that is more |
| 83 | + specific about the bitsize, such as `CuArray{Float64}(A)`. Additionally, preferring more |
| 84 | + stable factorization methods, such as `QRFactorization()`, can improve the numerics in |
| 85 | + such cases. |
| 86 | + |
| 87 | +Similarly to other use cases, you can choose the solver, for example: |
| 88 | + |
| 89 | +```julia |
| 90 | +sol = solve(prob, QRFactorization()) |
| 91 | +``` |
| 92 | + |
| 93 | +## Sparse Matrices on GPUs |
| 94 | + |
| 95 | +Currently, sparse matrix computations on GPUs are only supported for CUDA. This is done using |
| 96 | +the `CUDA.CUSPARSE` sublibrary. |
| 97 | + |
| 98 | +```julia |
| 99 | +using LinearAlgebra, CUDA.CUSPARSE |
| 100 | +T = Float32 |
| 101 | +n = 100 |
| 102 | +A_cpu = sprand(T, n, n, 0.05) + I |
| 103 | +x_cpu = zeros(T, n) |
| 104 | +b_cpu = rand(T, n) |
| 105 | + |
| 106 | +A_gpu_csr = CuSparseMatrixCSR(A_cpu) |
| 107 | +b_gpu = CuVector(b_cpu) |
| 108 | +``` |
| 109 | + |
| 110 | +In order to solve such problems using a direct method, you must add |
| 111 | +[CUDSS.jl](https://github.com/exanauts/CUDSS.jl). This looks like: |
| 112 | + |
| 113 | +```julia |
| 114 | +using CUDSS |
| 115 | +sol = solve(prob, LUFactorization()) |
| 116 | +``` |
| 117 | + |
| 118 | +!!! note |
| 119 | + |
| 120 | + For now, CUDSS only supports CuSparseMatrixCSR type matrices. |
| 121 | + |
| 122 | +Note that `KrylovJL` methods also work with sparse GPU arrays: |
| 123 | + |
| 124 | +```julia |
| 125 | +sol = solve(prob, KrylovJL_GMRES()) |
| 126 | +``` |
| 127 | + |
| 128 | +Note that CUSPARSE also has some GPU-based preconditioners, such as a built-in `ilu`. However: |
| 129 | + |
| 130 | +```julia |
| 131 | +sol = solve(prob, KrylovJL_GMRES(precs = (A, p) -> (CUDA.CUSPARSE.ilu02!(A, 'O'), I))) |
| 132 | +``` |
| 133 | + |
| 134 | +However, right now CUSPARSE is missing the right `ldiv!` implementation for this to work |
| 135 | +in general. See https://github.com/SciML/LinearSolve.jl/issues/341 for details. |
0 commit comments