Skip to content

[21pt] Proc count control and max errors management #125

Open
@RobHanna-NOAA

Description

@RobHanna-NOAA

Here is a common problem:
Load in a bunch of models for a huc (lets say 300), processing on a 48 core machine. Something went wrong, maybe all bad models, or a code issue or whatever. Right now... it is possible to spawn 47 cores (our code takes max number - 1 (48 - 1)), and if all of them fail, it can crash my machine. Even if it can get a bit farther before it crashes, we don't have an way to abort the run (across all procs) so it will continue to process as many models as it and in theory could fail on all up to 300 models.

Possible fixes (and might need both)

  1. Do we really want to leave worker proc counts to auto coded at (max cpu's - 1). At a min, do we want subtract 2 or 3 so we don't overload the computer to the point of crashing?
  2. We really need to figure out a way to abort the run.
  3. At a min, show we track the number of crashs and auto term the run? aka.. a preset number of 20, or maybe a ratio of 10% of the models?

We certainly need better cpu control and stop it from overloading and crashing the entire OS. (too many frozen procs and/or procs in use).

I am already actively working on one fix which is to manage HECRAS calls to they quite hanging. See Issue # 60 (as of Aug 2, 2023)

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions