Stochastic Tools Batch Mode HPC Hang #31558
-
Check these boxes if you have followed the posting rules.
QuestionHi! I have been using/building various stochastic tools and have found that in different scenarios while using the batch mode and running a job on INL HPC (PBS or SLURM), my job will get hung up after completing the first iteration during a batch-reset job. For example: I have found that on PBS (sawtooth), when requesting 1 node with 48 tasks, it will run fine with 8 min_proc_per_app/row however if I change it to 6, it gets hung up. When the batch system properly gets through the process, it will move through all 400 iterations (I will see 400 rank 0 processes in the output file), once complete it informs me that y/X samples complete until I get X/X complete and it will finish and output the results. I have tried to ensure there is sufficient memory, various different combinations of nodes and tasks and min_procs_per_app/row but have not found a solution for SLURM. Is this a known issue/problem or is there guidance on how to properly utilize batch mode and HPC resources? Thanks! |
Beta Was this translation helpful? Give feedback.
Replies: 1 comment 12 replies
-
|
Hello
what about just 2 nodes? Did you use the HPC-provided MPI compiler? You cannot use the conda or locally installed MPI distributions when working on HPC |
Beta Was this translation helpful? Give feedback.
@millerzac I can't seem to recreate the issue with a minimal example. If you could post your stochastic tools input, I can maybe try with something closer with what you're running.
If the issue is really the LHS sampler. One work around that doesn't require modifying the code is to save the samples into a CSV then use a
CSVSamplerin your main input. Basically, you create an input that just outputs the LHS samples (which can be run a single processor):