Optimizing XC instantiation for GPUs#1262
Conversation
|
This MPI failure we start to see a bit more consistently now. I'm a bit confused about it. |
I am also utterly confused. It certainly started happening after the extension of the MPI tests (#1248). It is hard to pin down, but I'll try and find out if the crash is systematically triggered by the same test or not. |
|
I generalized the dual existence of scalar and vectorized functions for all I chose to implement the vectorized functions at the lowest level of Psp, e.g. in Edit: concerning the failing MPI tests, it looks like failure always happen in the same test, namely in |
mfherbst
left a comment
There was a problem hiding this comment.
I think I agree with this mostly. We should see if we can remove some code duplication at the element level by deferring the vector -> map over single elements dispatch fully towards the dispatches on PSP structs.
|
I reorganized a bit, and I was able to reduce code duplication in |
|
@abussy I finally got around to giving this another look. My model was that this abstract type Element end
function evaluate(x::Element, ps::AbstractVector)
map(p -> evaluate(x, p), ps)
end
struct ElementPsp{T} <: Element
inner::T
end
evaluate(x::ElementPsp, ps) = evaluate(x.inner, ps)
evaluate(x::ElementPsp, ps::AbstractVector) = evaluate(x.inner, ps::AbstractVector) # <--- This resolves ambiguity
struct ZeroElement <: Element end
evaluate(::ZeroElement, p::Number) = 0
abstract type NormConservingPsp end
struct OnePsp <: NormConservingPsp end
struct OtherPsp <: NormConservingPsp end
function evaluate(x::NormConservingPsp, ps::AbstractVector)
map(p -> evaluate(x, p), ps)
end
evaluate(::OnePsp, p::Number) = 1
evaluate(::OtherPsp, ps::AbstractVector) = copy(ps)
evaluate(::OtherPsp, p::Number) = p+1
@assert evaluate(ElementPsp(OnePsp()), [2, 4]) == [1, 1]
@assert evaluate(ElementPsp(OnePsp()), 2) == 1
@assert evaluate(ElementPsp(OtherPsp()), [2, 4]) == [2, 4]
@assert evaluate(ElementPsp(OtherPsp()), 2) == 3 works fine, but only after the highlighted line is added. A silicon calculation runs. Let's see what the tests say and then I'll also try it on a GPU. |
|
Ok I think there is another problem in the logic here. Now eval_psp_projector_fourier(psp::NormConservingPsp, i, l, p::AbstractVector) =
eval_psp_projector_fourier(psp, i, l, norm(p))
# Fallback vectorized implementation for non GPU-optimized code.
function eval_psp_projector_fourier(psp::NormConservingPsp, i, l,
ps::AbstractVector{T}) where {T <: Real}
arch = architecture(ps)
to_device(arch, map(p -> eval_psp_projector_fourier(psp, i, l, p), to_cpu(ps)))
end The idea of the first function was to take the norm for you, for convenience. i.e. you could just call the function on [1, 0, 0] and it would implicitly call the function with norm Removing the versions which implicitly take the norm (which is what I propose) has the additional advantage that it should remove the method ambiguities and all the annoying macro magic could be removed also at the PSP level. |
|
@abussy I have not checked the perf implications, but the current code runs on the GPU without any issues and looks reasonably clean. Could you check this completes the intention of your PR without regression ? If yes, I'd say we merge this. |
This PR optimizes the instantiation of the XC term on the GPU, with a particularly large impact when Duals are involved (stress and response calculations). It follows the same design patterns as #1163.
Particular care was given to the
hankelfunction inworkarounds/forwarddiff_rules.jl, in order to make it GPU compatible. A side effect is the suppression of allocations, which also marginally (few %) speeds-up Xc instantiation on the CPU.For illustration, here are XC instantiation timings for the input below (stress of a 3x3x3 rattled aluminium supercell):