[21pt] Proc count control and max errors management

Here is a common problem:
Load in a bunch of models for a huc (lets say 300), processing on a 48 core machine.  Something went wrong, maybe all bad models, or a code issue or whatever.  Right now... it is possible to spawn 47 cores (our code takes max number - 1 (48 - 1)), and if all of them fail, it can crash my machine.  Even if it can get a bit farther before it crashes, we don't have an way to abort the run (across all procs) so it will continue to process as many models as it and in theory could fail on all up to 300 models.

Possible fixes (and might need both)
1) Do we really want to leave worker proc counts to auto coded at (max cpu's - 1).  At a min, do we want subtract 2 or 3 so we don't overload the computer to the point of crashing?
2) We really need to figure out a way to abort the run.
3) At a min, show we track the number of crashs and auto term the run?  aka.. a preset number of 20, or maybe a ratio of 10% of the models?

We certainly need better cpu control and stop it from overloading and crashing the entire OS. (too many frozen procs and/or procs in use).

I am already actively working on one fix which is to manage HECRAS calls to they quite hanging.  See [Issue # 60](https://github.com/NOAA-OWP/ras2fim/issues/60) (as of Aug 2, 2023)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[21pt] Proc count control and max errors management #125

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[21pt] Proc count control and max errors management #125

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions