Large mesh best practices and memory management #25053
-
|
Hi all, I have a few questions about how best to deal with large meshes, as I have been running into errors which I think are to do with processes running out of memory. 1. Mesh splitting: However, it also seems that you need to fit the whole mesh into memory in order to perform the splitting operations in the first place? If the mesh doesn't fit, you can't split it to alleviate that problem? Is the idea that you can usually fit the whole mesh into memory when it is just the mesh (ie, when performing the splitting), but during a simulation you need to allocate some of that memory to the nonlinear and aux systems, for instance, so you now can't afford to store the whole mesh? A further question is whether the mesh splitting operation somehow assumes that you will have the same amount of memory when you run the simulation as you did during splitting? 2. What is the difference between pre-splitting the mesh and setting Up to now, I had thought that setting 3. What about storing the nonlinear and auxiliary system variables? In case it's useful, I usually get exit code 7 (bus error) when things don't work, and occasionally exit code 9 (killed) from mpirun. 5 million elements doesn't seem completely unreasonable, so any insight into where I might be going wrong is greatly appreciated. Thanks! |
Beta Was this translation helpful? Give feedback.
Replies: 4 comments 14 replies
-
This is a very reasonable size. How many degrees of freedom in nonlinear and aux systems (you can see this in the header printout)?
That's correct. You gain two things here by each processor only needing to load its small part of mesh (keep in mind that when loading from splits, you're forcing distributed mesh):
You should split the mesh only using a few processes. That is: Each process will still load the full mesh (so 4 copies of the mesh), and each will be responsible for saving 150 partitions in this case. As your mesh gets bigger, you can actually just run this in serial: and it'll take longer, but that's fine - it's only done once. As you get bigger and bigger meshes, most HPCs usually have high memory or visualization nodes. These are perfect for splitting meshes.
Kind of. But, when you get into meshes that are millions of elements... in general we recommend pre splitting always so you don't have to even think about this problem.
Like I said above. When you don't pre split but use distributed mesh, the mesh is serialized (loaded fully) on all processes, partitioned on all processes, and then deleted. Thus, when you don't do pre split, at one point each process has the full mesh in memory. You can sometimes get away with this because this memory jump is done before adding systems/vectors/matrices/etc.
Whether or not you are replicated or distributed, vectors and matrices are always distributed. So - the overhead for this should not be different.
As I asked above - can you share how many degrees of freedom you have in your problem? That would give us an idea on usage. |
Beta Was this translation helpful? Give feedback.
-
|
As a followup question to this, could someone confirm exactly what the reported memory use in the output file is telling us? Below is an example of the reporting I am referring to: And: |
Beta Was this translation helpful? Give feedback.
-
The way this is done is by partitioning, for which how this process done is equivalent for a given processor configuration (# procs), regardless of distributed or replicated mesh. This partitionining assigns nodal and elemental values in the mesh to a particular processor. Once this process is done, this denotes who owns what part of the vectors and matrices in the problem. When the vectors and matrices are allocated (regardless of distributed or replicated mesh), they are still distributed. If you do not need ghosting for a vector or matrix, each processor will only allocate the degrees of freedom that it owns. If you need some ghosting, you'll have a bit more allocation here in order to have space for the entries that you need that other processors own.
This is likely because you have more available memory on a compute node, no? This is a common process. Say you have a compute node that has 256GB of memory a piece and 48 procs. If you request this whole node, you get 256 / 48 = 5.3 GB / proc. If you request the whole node but only use half the procs, you get 256 / 24 = 10.6GB / proc. That is - when you have a memory constrained problem, we often request fewer processors but request a whole node so that we have more memory available. Are you sure that you're not just resolving your problems because you have more memory available?
This isn't the case. The matricies and vectors (aside from ghosting) really are allocated in a distributed sense. With pre-split mesh, for a standard problem, there should be very little replicated data. However, there is a chance that you could be using an algorithm or some capability that doesn't support distribution and does some replication. Can you share what you're trying to run? |
Beta Was this translation helpful? Give feedback.
-
|
I have a couple of quite similar questions.
|
Beta Was this translation helpful? Give feedback.
This is a very reasonable size. How many degrees of freedom in nonlinear and aux systems (you can see this in the header printout)?
That's correct. You gain two things here by each processor only needing …