-
Notifications
You must be signed in to change notification settings - Fork 24
Add tests for GPU distributed DatasetRestoring #694
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
Remove unnecessary blank line in ocean_simulation.jl
…ss/test-dataset-restoring
…ss/test-dataset-restoring
|
Actually, the error I was finding, which was a CUDA Illegal memory access, is just connected to the fact that we had a ERROR: DomainError with -1.0:
sqrt was called with a negative real argument but will only return a complex result if called with a complex argument. Try sqrt(Complex(x)).
Stacktrace:
[1] throw_complex_domainerror(f::Symbol, x::Float64)
@ Base.Math ./math.jl:33
[2] sqrt
@ ./math.jl:686 [inlined]
[3] sqrt(x::Int64)
@ Base.Math ./math.jl:1578
[4] top-level scope
@ REPL[1]:1error. |
whoa. I wonder how many times I have seen this. wow. |
I was mind-blown as well... |
|
I tried writing up a MWE, but I get NaNs... Maybe it's how these NaNs are propagated that generate the CUDA illegal memory access? julia> using KernelAbstractions, CUDA
julia> @kernel function negative_sqrt!(a)
i = @index(Global, Linear)
@inbounds a[i] = sqrt(-1)
end
julia> a = zeros(5);
julia> loop! = negative_sqrt!(KernelAbstractions.CPU(), 5, 5)
KernelAbstractions.Kernel{CPU, KernelAbstractions.NDIteration.StaticSize{(5,)}, KernelAbstractions.NDIteration.StaticSize{(5,)}, typeof(cpu_negative_sqrt!)}(CPU(false), cpu_negative_sqrt!)
julia> loop!(a)
ERROR: DomainError with -1.0:
sqrt was called with a negative real argument but will only return a complex result if called with a complex argument. Try sqrt(Complex(x)).
Stacktrace:
[1] throw_complex_domainerror(f::Symbol, x::Float64)
@ Base.Math ./math.jl:33
[2] sqrt
@ ./math.jl:627 [inlined]
[3] sqrt(x::Int64)
@ Base.Math ./math.jl:1546
[4] macro expansion
@ ~/.julia/packages/KernelAbstractions/X5fk1/src/macros.jl:314 [inlined]
[5] cpu_negative_sqrt!(__ctx__::KernelAbstractions.CompilerMetadata{…}, a::Vector{…})
@ Main ./none:0
[6] __thread_run(tid::Int64, len::Int64, rem::Int64, obj::KernelAbstractions.Kernel{…}, ndrange::Nothing, iterspace::KernelAbstractions.NDIteration.NDRange{…}, args::Tuple{…}, dynamic::KernelAbstractions.NDIteration.NoDynamicCheck)
@ KernelAbstractions ~/.julia/packages/KernelAbstractions/X5fk1/src/cpu.jl:145
[7] __run(obj::KernelAbstractions.Kernel{…}, ndrange::Nothing, iterspace::KernelAbstractions.NDIteration.NDRange{…}, args::Tuple{…}, dynamic::KernelAbstractions.NDIteration.NoDynamicCheck, static_threads::Bool)
@ KernelAbstractions ~/.julia/packages/KernelAbstractions/X5fk1/src/cpu.jl:112
[8] #_#20
@ ~/.julia/packages/KernelAbstractions/X5fk1/src/cpu.jl:46 [inlined]
[9] (::KernelAbstractions.Kernel{…})(args::Vector{…})
@ KernelAbstractions ~/.julia/packages/KernelAbstractions/X5fk1/src/cpu.jl:39
[10] top-level scope
@ REPL[8]:1
Some type information was truncated. Use `show(err)` to see complete types.
julia> a = CuArray(a);
julia> loop! = negative_sqrt!(CUDA.CUDABackend(), 5, 5)
KernelAbstractions.Kernel{CUDABackend, KernelAbstractions.NDIteration.StaticSize{(5,)}, KernelAbstractions.NDIteration.StaticSize{(5,)}, typeof(gpu_negative_sqrt!)}(CUDABackend(false, false), gpu_negative_sqrt!)
julia> loop!(a)
julia> a
5-element CuArray{Float64, 1, CUDA.DeviceMemory}:
NaN
NaN
NaN
NaN
NaN |
I am having some issues with
DatasetRestoringon multiple GPUs. Theupdate_model_field_time_series!function crashes (deterministically) with CUDA illegal memory access connected to Oceananigans'cpu_interpolating_time_indices.I am trying to debug this as it is halting the OMIP progress, however, in the meantime I am adding a dataset restoring GPU test for multi-GPU here to see if I can reproduce the error.
I think the tests need a bit of an overhaul.