Performance issues running MOOSE on an HPC- how much (approximate) RAM should I need to use for a problem like this? #30019
-
|
I am running a problem with a very fine mesh out of necessity, the mesh is about 260 MB large, and the nonlinear system has 1,865,010 degrees of freedom. According to the documentation, 1 CPU per 20,000 DOFs is good ballpark, but I haven't gotten my simulation to run. Initially, I did 1 node, 20 CPUs/node, 4G of memory/CPU. The simulation progressed passed "setting up", and output the initial residual (where everything is 0.000), and then failed with an OOM error. I increased the memory to 8G/CPU, but it still failed with an OOM error. I even changed the solve type from LU: to an iterative solver: But still got the OOM error. Then, following the CPU/DOF guideline, I requested 5 nodes, 20 CPUs/node, 4G Memory/CPU. This time, it also made it past "setting up" but never got to the initial residual evaluation: After which the job was cancelled for exceeding the time limit (ran it on a short partition, so 4 hours/job). I also increased the number of nodes and kept running into the same problem. I wanted to try increasing the time now, but was wondering if there is anything I'm missing that could help performance. Also, I ran a problem with a slightly less refined mesh (160MB, so I'm guessing it has ~half the DOFs) with the initial setup, i.e 1 node, 20 CPUs/node, 4G/CPU. P.S. The mesh changing is a bit confusing to me since this is a thermal simulation. Yes I used a custom kernel to write the heat equation in the frequency domain but it is not a simulation where the mesh should be "displaced" like the preamble says. |
Beta Was this translation helpful? Give feedback.
Replies: 1 comment 10 replies
-
|
Hello The switch to an iterative solver is a good idea, LU wont scale to 2M DOFs It seems the initializationn is either very slow or hanging. Can you use a debugger after it has been running for a while to get the backtrace on a few of the processes? This will tell us what it is doing |
Beta Was this translation helpful? Give feedback.
Confirmed that the errors I was seeing previously were a mesh issue... I created another mesh with even higher refinement, but structured it in a way that made it easier for
SubdomainBoundingBoxGeneratorand the OOM errors disappeared. The ORTE Daemon Failure errors stop showing up today (I didnt do anything differently) so I assume it must've just been a system issue. Thanks!