Skip to content

(0.96.28) Small improvements to NaNChecker #4470

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 10 commits into
base: main
Choose a base branch
from

Conversation

glwagner
Copy link
Member

@glwagner glwagner commented May 4, 2025

Inspired by a discussion with @navidcy and @taimoorsohail, this PR implements two small improvements to the NaNChecker:

  1. Preallocate an object to store the result of the reduction which computes whether a field has NaNs or not, which saves some memory allocation (can be important if trying to diagnose NaNs every time-step
  2. Return the result of NaN checking from the callback function. Useful if we don't want to add NaNChecker directly as a callback but rather call it within another callback.

I may add a few more improvements here, because more ideas are coming to mind...

Copy link
Member

@navidcy navidcy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sounds good! thanks!

@navidcy
Copy link
Member

navidcy commented May 5, 2025

bunch of tests fail, e.g.,

https://buildkite.com/clima/oceananigans/builds/23183#01969c41-533f-4867-bc39-028df30cd953/6-37

Is it because the fallback

NaNChecker(fields) = NaNChecker(fields, false) # default

was removed? Or perhaps because the fallback has an arg and now it became kwarg?

@glwagner
Copy link
Member Author

glwagner commented May 5, 2025

I think because somebody is using that constructor. I think its better to have just one constructor so I will try to fix the errors

@navidcy navidcy changed the title Small improvements to NaNChecker (0.96.25) Small improvements to NaNChecker May 8, 2025
@navidcy
Copy link
Member

navidcy commented May 8, 2025

@taimoorsohail, should work now and test will (hopefully) pass

@taimoorsohail
Copy link
Collaborator

Tests are still failing - having a look now

@taimoorsohail
Copy link
Collaborator

I haven't been able to figure out why these tests are failing - other than that they don't seem to be related to the NaN checker changes as far as I can tell.

@navidcy
Copy link
Member

navidcy commented May 9, 2025

This

https://buildkite.com/clima/oceananigans/builds/23277#0196b2e0-8d99-4d3d-a91d-1a4e8c0275e9/18-1245

might seem unrelated but following the error stack trace it does originate from the nanchecker

https://buildkite.com/clima/oceananigans/builds/23277#0196b2e0-8d99-4d3d-a91d-1a4e8c0275e9/18-1273

I'll have a look now.

@navidcy
Copy link
Member

navidcy commented May 9, 2025

@glwagner, seems like hasnan breaks for MultiRegion (probably also Distributed?) fields? Here's a mwe:

using Oceananigans

grid = RectilinearGrid(CPU(), size = (64, 3, 1), x = (0, 100e3), y = (0, 100e3), z = (-400, 0))

regions = 2
mrg = MultiRegionGrid(grid, partition = XPartition(regions))

model = HydrostaticFreeSurfaceModel(grid = mrg)

simulation = Simulation(model; Δt=0.1, stop_iteration = 10)
run!(simulation)
julia> run!(simulation)
[2025/05/09 12:12:48.874] INFO  Initializing simulation...
ERROR: MethodError: no method matching interior(::MultiRegionObject{…}, ::Tuple{…}, ::MultiRegionGrid{…}, ::MultiRegionObject{…})

Closest candidates are:
  interior(::Field, ::Any...)
   @ Oceananigans ~/Library/CloudStorage/OneDrive-TheUniversityofMelbourne/Documents/Research/Oceananigans.jl-v3/src/Fields/field.jl:407
  interior(::FieldTimeSeries, ::Any...)
   @ Oceananigans ~/Library/CloudStorage/OneDrive-TheUniversityofMelbourne/Documents/Research/Oceananigans.jl-v3/src/OutputReaders/field_time_series.jl:724
  interior(::OffsetArray, ::Any, ::Any, ::Any)
   @ Oceananigans ~/Library/CloudStorage/OneDrive-TheUniversityofMelbourne/Documents/Research/Oceananigans.jl-v3/src/Fields/field.jl:406
  ...

Stacktrace:
  [1] interior(f::Field{…})
    @ Oceananigans.Fields ~/Library/CloudStorage/OneDrive-TheUniversityofMelbourne/Documents/Research/Oceananigans.jl-v3/src/Fields/field.jl:405
  [2] any!(f::Function, r::Field{…}, a::Field{…}; condition::Nothing, mask::Bool, kwargs::@Kwargs{})
    @ Oceananigans.Fields ~/Library/CloudStorage/OneDrive-TheUniversityofMelbourne/Documents/Research/Oceananigans.jl-v3/src/Fields/field.jl:682
  [3] any!(f::Function, r::Field{…}, a::Field{…})
    @ Oceananigans.Fields ~/Library/CloudStorage/OneDrive-TheUniversityofMelbourne/Documents/Research/Oceananigans.jl-v3/src/Fields/field.jl:673
  [4] hasnan(field::Field{…}, checker::Oceananigans.Models.NaNChecker{…})
    @ Oceananigans.Models ~/Library/CloudStorage/OneDrive-TheUniversityofMelbourne/Documents/Research/Oceananigans.jl-v3/src/Models/nan_checker.jl:40
  [5] (::Oceananigans.Models.NaNChecker{…})(simulation::Simulation{…})
    @ Oceananigans.Models ~/Library/CloudStorage/OneDrive-TheUniversityofMelbourne/Documents/Research/Oceananigans.jl-v3/src/Models/nan_checker.jl:49
  [6] (::Callback{…})(sim::Simulation{…})
    @ Oceananigans.Simulations ~/Library/CloudStorage/OneDrive-TheUniversityofMelbourne/Documents/Research/Oceananigans.jl-v3/src/Simulations/callback.jl:15
  [7] initialize!(sim::Simulation{…})
    @ Oceananigans.Simulations ~/Library/CloudStorage/OneDrive-TheUniversityofMelbourne/Documents/Research/Oceananigans.jl-v3/src/Simulations/run.jl:235
  [8] time_step!(sim::Simulation{…})
    @ Oceananigans.Simulations ~/Library/CloudStorage/OneDrive-TheUniversityofMelbourne/Documents/Research/Oceananigans.jl-v3/src/Simulations/run.jl:136
  [9] run!(sim::Simulation{…}; pickup::Bool)
    @ Oceananigans.Simulations ~/Library/CloudStorage/OneDrive-TheUniversityofMelbourne/Documents/Research/Oceananigans.jl-v3/src/Simulations/run.jl:105
 [10] run!(sim::Simulation{…})
    @ Oceananigans.Simulations ~/Library/CloudStorage/OneDrive-TheUniversityofMelbourne/Documents/Research/Oceananigans.jl-v3/src/Simulations/run.jl:92
 [11] top-level scope
    @ REPL[75]:1
Some type information was truncated. Use `show(err)` to see complete types.

@navidcy
Copy link
Member

navidcy commented May 9, 2025

Wondering if we need @apply_regionally at

any!(isnan, checker.hasnan, field)

@navidcy
Copy link
Member

navidcy commented May 11, 2025

Actually on second thought I'm wondering how would this work on distributed architectures. Do we need to be calling any! eg on each MPI rank and then combine?

@glwagner
Copy link
Member Author

Wondering if we need @apply_regionally at

any!(isnan, checker.hasnan, field)

shouldn't that work under the hood? same with distributed. otherwise users cannot reduce either without @apply_regionally

@navidcy
Copy link
Member

navidcy commented May 20, 2025

I think it should work after #4497

@taimoorsohail
Copy link
Collaborator

Is this ready to merge once tests pass? now that #4520 is merged?

@navidcy
Copy link
Member

navidcy commented May 22, 2025

No. First #4497 needs to get in

@taimoorsohail
Copy link
Collaborator

Thanks!

@navidcy navidcy changed the title (0.96.25) Small improvements to NaNChecker (0.96.28) Small improvements to NaNChecker May 22, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants