Implement GPU performance checker

We should have logic in the code that detects if we are likely going to have poor GPU performance, and aborts the run if so. The simplest logic here is just to check if all of the GPU memory is allocated. The user would get a warning message saying they should restart the run with more GPUs (or run a smaller problem). If the user really wants to run in this way, we'll implement an "expert mode" runtime option that allows them to override this constraint.

The only subtlety here is how to actually stop the run. Should it be a hard crash (i.e. amrex::Error()) or a graceful one (i.e. allow the timestep to complete, and write a checkpoint)? Many of us use job scripts that chain jobs at HPC centers -- how can we implement this in a way where those job scripts can easily detect that a run has been stopped for this reason and we should no longer keep chaining?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Implement GPU performance checker #798

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Implement GPU performance checker #798

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions