Skip to content

Conversation

@simone-silvestri
Copy link
Collaborator

I am having some issues with DatasetRestoring on multiple GPUs. The update_model_field_time_series! function crashes (deterministically) with CUDA illegal memory access connected to Oceananigans' cpu_interpolating_time_indices.

I am trying to debug this as it is halting the OMIP progress, however, in the meantime I am adding a dataset restoring GPU test for multi-GPU here to see if I can reproduce the error.

I think the tests need a bit of an overhaul.

@simone-silvestri
Copy link
Collaborator Author

Actually, the error I was finding, which was a CUDA Illegal memory access, is just connected to the fact that we had a

ERROR: DomainError with -1.0:
sqrt was called with a negative real argument but will only return a complex result if called with a complex argument. Try sqrt(Complex(x)).
Stacktrace:
 [1] throw_complex_domainerror(f::Symbol, x::Float64)
   @ Base.Math ./math.jl:33
 [2] sqrt
   @ ./math.jl:686 [inlined]
 [3] sqrt(x::Int64)
   @ Base.Math ./math.jl:1578
 [4] top-level scope
   @ REPL[1]:1

error.
Apparently, this error is not shown on the GPU, but it corrupts the GPU memory which then eventually spits out the CUDA Illegal memory access.

@glwagner
Copy link
Member

glwagner commented Dec 4, 2025

Actually, the error I was finding, which was a CUDA Illegal memory access, is just connected to the fact that we had a

ERROR: DomainError with -1.0:
sqrt was called with a negative real argument but will only return a complex result if called with a complex argument. Try sqrt(Complex(x)).
Stacktrace:
 [1] throw_complex_domainerror(f::Symbol, x::Float64)
   @ Base.Math ./math.jl:33
 [2] sqrt
   @ ./math.jl:686 [inlined]
 [3] sqrt(x::Int64)
   @ Base.Math ./math.jl:1578
 [4] top-level scope
   @ REPL[1]:1

error. Apparently, this error is not shown on the GPU, but it corrupts the GPU memory which then eventually spits out the CUDA Illegal memory access.

whoa. I wonder how many times I have seen this. wow.

@navidcy
Copy link
Member

navidcy commented Dec 5, 2025

Actually, the error I was finding, which was a CUDA Illegal memory access, is just connected to the fact that we had a

ERROR: DomainError with -1.0:
sqrt was called with a negative real argument but will only return a complex result if called with a complex argument. Try sqrt(Complex(x)).
Stacktrace:
 [1] throw_complex_domainerror(f::Symbol, x::Float64)
   @ Base.Math ./math.jl:33
 [2] sqrt
   @ ./math.jl:686 [inlined]
 [3] sqrt(x::Int64)
   @ Base.Math ./math.jl:1578
 [4] top-level scope
   @ REPL[1]:1

error. Apparently, this error is not shown on the GPU, but it corrupts the GPU memory which then eventually spits out the CUDA Illegal memory access.

whoa. I wonder how many times I have seen this. wow.

I was mind-blown as well...

@simone-silvestri
Copy link
Collaborator Author

I tried writing up a MWE, but I get NaNs... Maybe it's how these NaNs are propagated that generate the CUDA illegal memory access?

julia> using KernelAbstractions, CUDA

julia> @kernel function negative_sqrt!(a)
          i = @index(Global, Linear)
          @inbounds a[i] = sqrt(-1)
       end

julia> a = zeros(5);

julia> loop! = negative_sqrt!(KernelAbstractions.CPU(), 5, 5)
KernelAbstractions.Kernel{CPU, KernelAbstractions.NDIteration.StaticSize{(5,)}, KernelAbstractions.NDIteration.StaticSize{(5,)}, typeof(cpu_negative_sqrt!)}(CPU(false), cpu_negative_sqrt!)

julia> loop!(a)
ERROR: DomainError with -1.0:
sqrt was called with a negative real argument but will only return a complex result if called with a complex argument. Try sqrt(Complex(x)).
Stacktrace:
  [1] throw_complex_domainerror(f::Symbol, x::Float64)
    @ Base.Math ./math.jl:33
  [2] sqrt
    @ ./math.jl:627 [inlined]
  [3] sqrt(x::Int64)
    @ Base.Math ./math.jl:1546
  [4] macro expansion
    @ ~/.julia/packages/KernelAbstractions/X5fk1/src/macros.jl:314 [inlined]
  [5] cpu_negative_sqrt!(__ctx__::KernelAbstractions.CompilerMetadata{…}, a::Vector{…})
    @ Main ./none:0
  [6] __thread_run(tid::Int64, len::Int64, rem::Int64, obj::KernelAbstractions.Kernel{…}, ndrange::Nothing, iterspace::KernelAbstractions.NDIteration.NDRange{…}, args::Tuple{…}, dynamic::KernelAbstractions.NDIteration.NoDynamicCheck)
    @ KernelAbstractions ~/.julia/packages/KernelAbstractions/X5fk1/src/cpu.jl:145
  [7] __run(obj::KernelAbstractions.Kernel{…}, ndrange::Nothing, iterspace::KernelAbstractions.NDIteration.NDRange{…}, args::Tuple{…}, dynamic::KernelAbstractions.NDIteration.NoDynamicCheck, static_threads::Bool)
    @ KernelAbstractions ~/.julia/packages/KernelAbstractions/X5fk1/src/cpu.jl:112
  [8] #_#20
    @ ~/.julia/packages/KernelAbstractions/X5fk1/src/cpu.jl:46 [inlined]
  [9] (::KernelAbstractions.Kernel{…})(args::Vector{…})
    @ KernelAbstractions ~/.julia/packages/KernelAbstractions/X5fk1/src/cpu.jl:39
 [10] top-level scope
    @ REPL[8]:1
Some type information was truncated. Use `show(err)` to see complete types.

julia> a = CuArray(a);

julia> loop! = negative_sqrt!(CUDA.CUDABackend(), 5, 5)
KernelAbstractions.Kernel{CUDABackend, KernelAbstractions.NDIteration.StaticSize{(5,)}, KernelAbstractions.NDIteration.StaticSize{(5,)}, typeof(gpu_negative_sqrt!)}(CUDABackend(false, false), gpu_negative_sqrt!)

julia> loop!(a)

julia> a
5-element CuArray{Float64, 1, CUDA.DeviceMemory}:
 NaN
 NaN
 NaN
 NaN
 NaN

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants