-
Notifications
You must be signed in to change notification settings - Fork 104
Description
We should have logic in the code that detects if we are likely going to have poor GPU performance, and aborts the run if so. The simplest logic here is just to check if all of the GPU memory is allocated. The user would get a warning message saying they should restart the run with more GPUs (or run a smaller problem). If the user really wants to run in this way, we'll implement an "expert mode" runtime option that allows them to override this constraint.
The only subtlety here is how to actually stop the run. Should it be a hard crash (i.e. amrex::Error()) or a graceful one (i.e. allow the timestep to complete, and write a checkpoint)? Many of us use job scripts that chain jobs at HPC centers -- how can we implement this in a way where those job scripts can easily detect that a run has been stopped for this reason and we should no longer keep chaining?