Skip to content

L1 Gauss Seidel preconditioner #191

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Draft
wants to merge 55 commits into
base: main
Choose a base branch
from
Draft
Show file tree
Hide file tree
Changes from 10 commits
Commits
Show all changes
55 commits
Select commit Hold shift + click to select a range
2382a8f
init l1 smoother
Abdelrahman912 Feb 28, 2025
e8991ba
minor fix
Abdelrahman912 Feb 28, 2025
a127bc0
init cuda prec setup
Abdelrahman912 Mar 23, 2025
d216e34
init working cuda
Abdelrahman912 Mar 24, 2025
086ed78
fix partition limit indices
Abdelrahman912 Mar 24, 2025
99b1ab6
add comment
Abdelrahman912 Mar 24, 2025
64a247c
minor change
Abdelrahman912 Mar 24, 2025
343c99d
minor adjustment for csc
Abdelrahman912 Mar 25, 2025
ee66f73
add cuda csr
Abdelrahman912 Mar 25, 2025
fbb3338
check symmetry for csc
Abdelrahman912 Mar 26, 2025
2b2dc17
Merge branch 'main' into l1-gs-smoother
Abdelrahman912 Mar 26, 2025
5e1203a
rm unnecessary code
Abdelrahman912 Mar 26, 2025
bc1cec3
Merge branch 'main' into l1-gs-smoother
Abdelrahman912 Mar 26, 2025
c8cc291
add cpu version
Abdelrahman912 Mar 27, 2025
457c110
Merge branch 'add-multi-threading-l1-prec' into l1-gs-smoother
Abdelrahman912 Mar 27, 2025
7f17845
Merge branch 'main' into l1-gs-smoother
Abdelrahman912 Mar 28, 2025
7c9b474
Merge branch 'main' into l1-gs-smoother
Abdelrahman912 Apr 1, 2025
42026cd
init ka, working but buggy.
Abdelrahman912 Apr 1, 2025
1b3ce6f
Merge branch 'main' into l1-gs-smoother
Abdelrahman912 Apr 1, 2025
8a5675b
Merge branch 'ka-porting' into l1-gs-smoother
Abdelrahman912 Apr 1, 2025
476928b
Merge branch 'main' into l1-gs-smoother
Abdelrahman912 Apr 2, 2025
8bcca25
fix ka buggy code
Abdelrahman912 Apr 2, 2025
77f7148
add tests
Abdelrahman912 Apr 2, 2025
aecbf67
minor fix
Abdelrahman912 Apr 2, 2025
1aa8986
update manifest
Abdelrahman912 Apr 2, 2025
552f8b8
Merge branch 'main' into l1-gs-smoother
Abdelrahman912 Apr 2, 2025
dc8d4e4
merge cpu and gpu
Abdelrahman912 Apr 8, 2025
24b72ab
Merge branch 'main' into l1-gs-smoother
Abdelrahman912 Apr 8, 2025
a87c66f
minor fix
Abdelrahman912 Apr 8, 2025
5778fe9
add Preconditioners submodule
Abdelrahman912 Apr 9, 2025
b104cfb
remove unnecessary module reference
Abdelrahman912 Apr 9, 2025
0e93f05
add cpu symmetric test
Abdelrahman912 Apr 9, 2025
51fa896
add test path
Abdelrahman912 Apr 9, 2025
1f56bad
minor fix
Abdelrahman912 Apr 9, 2025
226f55a
set nparts to be ncores
Abdelrahman912 Apr 9, 2025
292aacc
precompute blocks
Abdelrahman912 Apr 10, 2025
36f5754
separate CPU GPU tests
Abdelrahman912 Apr 10, 2025
33f1de7
fix ci
Abdelrahman912 Apr 10, 2025
3b52869
minor fix
Abdelrahman912 Apr 10, 2025
a056a67
add symmetric test
Abdelrahman912 Apr 10, 2025
bf2cc96
rm dead code
Abdelrahman912 Apr 10, 2025
a84ab45
comment out adapt
Abdelrahman912 Apr 10, 2025
0794dce
rm direct solver
Abdelrahman912 Apr 10, 2025
2532179
add doc string
Abdelrahman912 Apr 10, 2025
8203cac
add gpu test examples
Abdelrahman912 Apr 11, 2025
7ef9e15
minor fix
Abdelrahman912 Apr 14, 2025
869d3db
elementwise operations refinement
Abdelrahman912 Apr 14, 2025
9609d59
add reference
Abdelrahman912 Apr 14, 2025
8e62678
add block partitioning to doc string + some comments for (CSC/CSR)Format
Abdelrahman912 Apr 14, 2025
4a6454c
rm piratical code (only those which were merged into CUDA.jl) + add w…
Abdelrahman912 Apr 18, 2025
cce6547
rm dead code
Abdelrahman912 Apr 22, 2025
2ca65a6
init gs
Abdelrahman912 Apr 29, 2025
1e01857
init forward_sweep
Abdelrahman912 May 3, 2025
01faea3
minor fixes
Abdelrahman912 May 3, 2025
df65f07
minor fixes (buggy test )
Abdelrahman912 May 3, 2025
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
12 changes: 10 additions & 2 deletions ext/CuThunderboltExt.jl
Original file line number Diff line number Diff line change
@@ -1,6 +1,7 @@
module CuThunderboltExt

using Thunderbolt
using LinearSolve

import CUDA:
CUDA, CuArray, CuVector, CUSPARSE,blockDim,blockIdx,gridDim,threadIdx,
Expand All @@ -10,13 +11,14 @@ import CUDA:
import Thunderbolt:
UnPack.@unpack,
SimpleMesh,
SparseMatrixCSR, SparseMatrixCSC,
SparseMatrixCSR, SparseMatrixCSC, AbstractSparseMatrix,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we import these types here from their respective packages instead of Thunderbolt? We might hit weird bugs otherwise.

AbstractSemidiscreteFunction, AbstractPointwiseFunction, solution_size,
AbstractPointwiseSolverCache,assemble_element!,
LinearOperator,QuadratureRuleCollection,
AnalyticalCoefficientElementCache,AnalyticalCoefficientCache,CartesianCoordinateSystemCache,
setup_element_cache,update_operator!,init_linear_operator,FieldCoefficientCache, CudaAssemblyStrategy, floattype,inttype,
convert_vec_to_concrete,deep_adapt,AbstractElementAssembly,GeneralLinearOperator
convert_vec_to_concrete,deep_adapt,AbstractElementAssembly,GeneralLinearOperator,
CudaL1PrecBuilder,build_l1prec,AbstractPartitioning,L1Preconditioner, ldiv!

import Thunderbolt.FerriteUtils:
StaticInterpolationValues,StaticCellValues, allocate_device_mem,
Expand Down Expand Up @@ -80,9 +82,15 @@ function Thunderbolt.adapt_vector_type(::Type{<:CuVector}, v::VT) where {VT <: V
return CuVector(v)
end

CUDA.CUSPARSE.CuSparseMatrixCSR{T}(Mat::SparseMatrixCSR) where {T} =
CUDA.CUSPARSE.CuSparseMatrixCSR{T}(CuVector{Cint}(Mat.rowptr), CuVector{Cint}(Mat.colval),
CuVector{T}(Mat.nzval), size(Mat))


include("cuda/cuda_operator.jl")
include("cuda/cuda_memalloc.jl")
include("cuda/cuda_adapt.jl")
include("cuda/cuda_iterator.jl")
include("cuda/cuda_preconditioner.jl")

end
219 changes: 219 additions & 0 deletions ext/cuda/cuda_preconditioner.jl
Original file line number Diff line number Diff line change
@@ -0,0 +1,219 @@
#########################################
## CUDA L1 Gauss Seidel Preconditioner ##
#########################################

struct CudaPartitioning{Ti} <: AbstractPartitioning
threads::Ti # number of diagonals per partition
blocks::Ti # number of partitions
size_A::Ti
end

function Thunderbolt.build_l1prec(::CudaL1PrecBuilder, A::AbstractSparseMatrix;
n_threads::Union{Integer,Nothing}=nothing, n_blocks::Union{Integer,Nothing}=nothing)
if CUDA.functional()
# Raise error if invalid thread or block count is provided
if !isnothing(n_threads) && n_threads == 0
error("n_threads must be greater than zero")
end
if !isnothing(n_blocks) && n_blocks == 0
error("n_blocks must be greater than zero")
end
return _build_cuda_l1prec(A, n_threads, n_blocks)
else
error("CUDA is not functional, please check your GPU driver and CUDA installation")
end
end

(builder::CudaL1PrecBuilder)(A::AbstractSparseMatrix;
n_threads::Union{Integer,Nothing}=nothing, n_blocks::Union{Integer,Nothing}=nothing) = build_l1prec(builder, A; n_threads=n_threads, n_blocks=n_blocks)

function _build_cuda_l1prec(A::AbstractSparseMatrix, n_threads::Union{Integer,Nothing}, n_blocks::Union{Integer,Nothing})
# Determine threads and blocks if not provided
# blocks -> number of partitions
# threads -> number of diagonals per partition
size_A = convert(Int32,size(A, 1))
threads = isnothing(n_threads) ? convert(Int32, min(size_A, 256)) : convert(Int32, n_threads)
blocks = isnothing(n_blocks) ? _calculate_nblocks(threads, size_A) : convert(Int32, n_blocks)
partitioning = CudaPartitioning(threads, blocks, size_A)
cuda_A = _cuda_A(A)
return L1Preconditioner(partitioning, cuda_A)
end

_cuda_A(A::SparseMatrixCSC) = CUSPARSE.CuSparseMatrixCSC(A)
_cuda_A(A::SparseMatrixCSR) = CUSPARSE.CuSparseMatrixCSR(A)
_cuda_A(A::CUSPARSE.CuSparseMatrixCSC) = A
_cuda_A(A::CUSPARSE.CuSparseMatrixCSR) = A

# TODO: should x & b be CuArrays? or leave them as AbstractVector?
function LinearSolve.ldiv!(y::VectorType, P::L1Preconditioner{CudaPartitioning{Ti}}, x::VectorType) where {Ti, VectorType <: AbstractVector}
# x: residual
# y: preconditioned residual
y .= x #works either way, whether x is CuArray or Vector
_ldiv!(y, P)
end

function _ldiv!(y::CuVector , P::L1Preconditioner{CudaPartitioning{Ti}}) where {Ti}
@unpack partitioning, A = P
@unpack threads, blocks, size_A = partitioning
issym = isapprox(A, A',rtol=1e-12)
CUDA.@sync CUDA.@cuda threads=threads blocks=blocks _l1prec_kernel!(y, A,issym)
return nothing
end

function _ldiv!(y::Vector , P::L1Preconditioner{CudaPartitioning{Ti}}) where {Ti}
@unpack partitioning, A = P
@unpack threads, blocks, size_A = partitioning
cy = y |> cu
issym = isapprox(A, A',rtol=1e-12)
CUDA.@sync CUDA.@cuda threads=threads blocks=blocks _l1prec_kernel!(cy, A,issym)
copyto!(y, cy)
return nothing
end

abstract type AbstractMatrixSymmetry end

struct SymmetricMatrix <: AbstractMatrixSymmetry end # important for the case of CSC format
struct NonSymmetricMatrix <: AbstractMatrixSymmetry end

# TODO: consider creating unified iterator for both CPU and GPU.
struct DeviceDiagonalIterator{MatrixType, MatrixSymmetry <: AbstractMatrixSymmetry}
A::MatrixType
end

struct DeviceDiagonalCache{Ti,Tv}
k::Ti # partition index
idx::Ti # diagonal index
b::Tv # partition diagonal value
d::Tv # off-partition absolute sum
end

DiagonalIterator(::Type{SymT}, A::MatrixType) where {SymT <: AbstractMatrixSymmetry, MatrixType} =
DeviceDiagonalIterator{MatrixType, SymT}(A)

function Base.iterate(iterator::DeviceDiagonalIterator)
idx = (blockIdx().x - Int32(1)) * blockDim().x + threadIdx().x # diagonal index
idx <= size(iterator.A, 1) || return nothing
k = blockIdx().x # partition index
return (_makecache(iterator,idx,k), (idx,k))
end

function Base.iterate(iterator::DeviceDiagonalIterator, state)
n_blocks = gridDim().x
n_threads = blockDim().x
idx,k = state
k += n_blocks # partition index
stride = n_blocks * n_threads
idx = idx + stride # diagonal index
idx <= size(iterator.A, 1) || return nothing
return (_makecache(iterator,idx,k), (idx,k))
end

function _makecache(iterator::DeviceDiagonalIterator{CUSPARSE.CuSparseDeviceMatrixCSC{Tv,Ti,1},NonSymmetricMatrix}, idx,k) where {Tv,Ti}
#Ωⁱ := {j ∈ Ωₖ : i ∈ Ωₖ}
#Ωⁱₒ := {j ∉ Ωₖ : i ∈ Ωₖ} off-partition column values
# bₖᵢ := Aᵢᵢ
# dₖᵢ := ∑_{j ∈ Ωⁱₒ} |Aᵢⱼ|
n_threads = blockDim().x
@unpack A = iterator
part_start_idx = (k - Int32(1)) * n_threads + Int32(1)
part_end_idx = min(part_start_idx + n_threads - Int32(1), size(A, 2))

b = zero(eltype(A))
d = zero(eltype(A))

# specific to CSC format
for col in 1:size(A, 2)
col_start = A.colPtr[col]
col_end = A.colPtr[col+1] - 1

for i in col_start:col_end
row = A.rowVal[i]
if row == idx
v = A.nzVal[i]

if part_start_idx > col || col > part_end_idx
d += abs(v)
end

if col == idx
b = v
end
end
end
end

return DeviceDiagonalCache(k, idx,b, d)
end

function _makecache(iterator::DeviceDiagonalIterator{CUSPARSE.CuSparseDeviceMatrixCSC{Tv,Ti,1},SymmetricMatrix}, idx,k) where {Tv,Ti}
#Ωⁱ := {j ∈ Ωₖ : i ∈ Ωₖ}
#Ωⁱₒ := {j ∉ Ωₖ : i ∈ Ωₖ} off-partition column values
# bₖᵢ := Aᵢᵢ
# dₖᵢ := ∑_{j ∈ Ωⁱₒ} |Aᵢⱼ|
n_threads = blockDim().x
@unpack A = iterator
part_start_idx = (k - Int32(1)) * n_threads + Int32(1)
part_end_idx = min(part_start_idx + n_threads - Int32(1), size(A, 2))

b = zero(eltype(A))
d = zero(eltype(A))

# since matrix is symmetric, then both CSC and CSR are the same.
b,d = _diag_offpart_csr(A.colPtr, A.rowVal, A.nzVal, idx, part_start_idx, part_end_idx)

return DeviceDiagonalCache(k, idx,b, d)
end

function _makecache(iterator::DeviceDiagonalIterator{CUSPARSE.CuSparseDeviceMatrixCSR{Tv,Ti,1}}, idx,k) where {Tv,Ti}
#Ωⁱ := {j ∈ Ωₖ : i ∈ Ωₖ}
#Ωⁱₒ := {j ∉ Ωₖ : i ∈ Ωₖ} off-partition column values
# bₖᵢ := Aᵢᵢ
# dₖᵢ := ∑_{j ∈ Ωⁱₒ} |Aᵢⱼ|
n_threads = blockDim().x
@unpack A = iterator # A is in CSR format
part_start_idx = (k - Int32(1)) * n_threads + Int32(1)
part_end_idx = min(part_start_idx + n_threads - Int32(1), size(A, 2))

b,d = _diag_offpart_csr(A.rowPtr, A.colVal, A.nzVal, idx, part_start_idx, part_end_idx)

return DeviceDiagonalCache(k, idx, b, d)
end


function _diag_offpart_csr(rowPtr, colVal, nzVal, idx::Integer, part_start::Integer, part_end::Integer)
Tv = eltype(nzVal)
b = zero(Tv)
d = zero(Tv)

row_start = rowPtr[idx]
row_end = rowPtr[idx + 1] - 1

for i in row_start:row_end
col = colVal[i]
v = nzVal[i]

if col == idx
b = v
elseif col < part_start || col > part_end
d += abs(v)
end
end

return b, d
end


function _l1prec_kernel!(y, A,issym)
# this kernel will loop over the corresponding diagonals in strided fashion.
# e.g. if n_threads = 4, n_blocks = 2, A is (100 x 100), and current global thread id = 5,
# then the kernel will loop over the diagonals with stride:
# k (partition index) = k + n_blocks (i.e. 2, 4, 6, 8, 10, 12, 14)
# idx (diagonal index) = idx + n_blocks * n_threads (i.e. 5, 13, 21, 29, 37, 45, 53)
symT = issym ? SymmetricMatrix : NonSymmetricMatrix
for diagonal in DiagonalIterator(symT,A)
@unpack k, idx, b, d = diagonal
@cushow k,d #TODO: remove this line
y[idx] = y[idx]/ (b + d)
end
return nothing
end
1 change: 1 addition & 0 deletions src/Thunderbolt.jl
Original file line number Diff line number Diff line change
Expand Up @@ -92,6 +92,7 @@ include("disambiguation.jl")
include("modeling/rsafdq2022.jl")
include("discretization/rsafdq-operator.jl")

include("solver/linear/preconditioner.jl")

# TODO put exports into the individual submodules above!
export
Expand Down
24 changes: 24 additions & 0 deletions src/solver/linear/preconditioner.jl
Original file line number Diff line number Diff line change
@@ -0,0 +1,24 @@
####################################
## L1 Gauss Seidel Preconditioner ##
####################################

## Interfaces for the L1 Gauss Seidel preconditioner
abstract type AbstractPartitioning end

struct L1Preconditioner{Partitioning,MatrixType}
partitioning::Partitioning
A::MatrixType
end


LinearSolve.ldiv!(::VectorType, ::L1Preconditioner{Partitioning}, ::VectorType) where {VectorType <: AbstractVector, Partitioning} =
error("Not implemented")

abstract type AbstractL1PrecBuilder end
struct CudaL1PrecBuilder <: AbstractL1PrecBuilder end

function build_l1prec(::AbstractL1PrecBuilder, ::AbstractMatrix)
error("Not implemented")
end

(builder::AbstractL1PrecBuilder)(A::AbstractMatrix) = build_l1prec(builder, A)
1 change: 1 addition & 0 deletions test/gpu/runtests.jl
Original file line number Diff line number Diff line change
Expand Up @@ -5,3 +5,4 @@ using StaticArrays

include("test_operators.jl")
include("test_coefficients.jl")
include("test_preconditioners.jl")
37 changes: 37 additions & 0 deletions test/gpu/test_preconditioners.jl
Original file line number Diff line number Diff line change
@@ -0,0 +1,37 @@
using SparseArrays
using CUDA
using Thunderbolt
using LinearSolve
using SparseMatricesCSR

N = 8
A = spdiagm(0 => 2 * ones(N), -1 => -ones(N-1), 1 => -ones(N-1))

cudal1prec = Thunderbolt.CudaL1PrecBuilder()
P = cudal1prec(A; n_threads=2, n_blocks=1)
x = eltype(A).(collect(0:N-1))
y = x
LinearSolve.ldiv!(y, P, x)


B = SparseMatrixCSR(A)
x = eltype(A).(collect(0:N-1))
y = x
P = cudal1prec(B; n_threads=2, n_blocks=1)
LinearSolve.ldiv!(y, P, x)
## TODO: Add tests for the above code snippet



abstract type AbstractMatrixSymmetry end

struct SymmetricMatrix <: AbstractMatrixSymmetry end
struct NonSymmetricMatrix <: AbstractMatrixSymmetry end

struct DeviceDiagonalIterator{MatrixType, MatrixSymmetry <: AbstractMatrixSymmetry}
A::MatrixType
end

matrix_symmetry_type(A::AbstractSparseMatrix) = isapprox(A, A',rtol=1e-12) ? SymmetricMatrix : NonSymmetricMatrix

DiagonalIterator(A::MatrixType) where {MatrixType} = DeviceDiagonalIterator{MatrixType,matrix_symmetry(A)}(A)
Loading