Skip to content

Implement GPU performance checker #798

@maximumcats

Description

@maximumcats

We should have logic in the code that detects if we are likely going to have poor GPU performance, and aborts the run if so. The simplest logic here is just to check if all of the GPU memory is allocated. The user would get a warning message saying they should restart the run with more GPUs (or run a smaller problem). If the user really wants to run in this way, we'll implement an "expert mode" runtime option that allows them to override this constraint.

The only subtlety here is how to actually stop the run. Should it be a hard crash (i.e. amrex::Error()) or a graceful one (i.e. allow the timestep to complete, and write a checkpoint)? Many of us use job scripts that chain jobs at HPC centers -- how can we implement this in a way where those job scripts can easily detect that a run has been stopped for this reason and we should no longer keep chaining?

Metadata

Metadata

Assignees

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions