Large mesh best practices and memory management #25053

Eilloo · 2023-07-28T14:08:33Z

Eilloo
Jul 28, 2023

Hi all,

I have a few questions about how best to deal with large meshes, as I have been running into errors which I think are to do with processes running out of memory.
For an idea of size, my mesh is around 5 million elements, and I am attempting to run my simulations across multiple nodes on an HPC.

1. Mesh splitting:
As I understand it, this pre-splits the mesh into a fixed number of chunks for later use with distributed mesh. The number of chunks must match the number of processors the simulation will then be run on, and the idea is that you only need to store a small part of the mesh for the process to run. Ideal when the mesh is too large to fit into the memory.

However, it also seems that you need to fit the whole mesh into memory in order to perform the splitting operations in the first place? If the mesh doesn't fit, you can't split it to alleviate that problem?

Is the idea that you can usually fit the whole mesh into memory when it is just the mesh (ie, when performing the splitting), but during a simulation you need to allocate some of that memory to the nonlinear and aux systems, for instance, so you now can't afford to store the whole mesh?

A further question is whether the mesh splitting operation somehow assumes that you will have the same amount of memory when you run the simulation as you did during splitting?
This is a tricky question to phrase, but the scenario is that I can fit a mesh one HPC node with a large amount of memory, and perform splitting operations there. However, if I try and use the resulting .cpr files on different HPC nodes with less memory, I run into mpi errors even if the number of processors is correct.
I wondered if this is related to how the mesh is split in the first place.

2. What is the difference between pre-splitting the mesh and setting --distributed-mesh, and setting parallel_type = distributed in the input file?

Up to now, I had thought that setting parallel_type = distributed would automatically distribute the mesh across the number of processors you are running on.
Is this the case, and if so, is the process behind the scenes any different than doing the splitting as a separate step?

3. What about storing the nonlinear and auxiliary system variables?
The last thing to ask is whether the mesh is usually the memory bottleneck? I have a fairly large number of variables, and an especially large number of aux variables since I am extracting all my vector variable components due to #24193.
Will these be distributed, or could this also be a problem? I wondered if this was the reason why I could successfully split a mesh, but mpi errors suggesting memory issues occur when I try and run the actual simulation.

In case it's useful, I usually get exit code 7 (bus error) when things don't work, and occasionally exit code 9 (killed) from mpirun.

5 million elements doesn't seem completely unreasonable, so any insight into where I might be going wrong is greatly appreciated.

Thanks!

Answered by loganharbour

Jul 28, 2023

For an idea of size, my mesh is around 5 million elements, and I am attempting to run my simulations across multiple nodes on an HPC.

This is a very reasonable size. How many degrees of freedom in nonlinear and aux systems (you can see this in the header printout)?

As I understand it, this pre-splits the mesh into a fixed number of chunks for later use with distributed mesh. The number of chunks must match the number of processors the simulation will then be run on, and the idea is that you only need to store a small part of the mesh for the process to run. Ideal when the mesh is too large to fit into the memory.

That's correct. You gain two things here by each processor only needing …

View full answer

loganharbour · 2023-07-28T16:49:19Z

loganharbour
Jul 28, 2023
Maintainer

For an idea of size, my mesh is around 5 million elements, and I am attempting to run my simulations across multiple nodes on an HPC.

This is a very reasonable size. How many degrees of freedom in nonlinear and aux systems (you can see this in the header printout)?

As I understand it, this pre-splits the mesh into a fixed number of chunks for later use with distributed mesh. The number of chunks must match the number of processors the simulation will then be run on, and the idea is that you only need to store a small part of the mesh for the process to run. Ideal when the mesh is too large to fit into the memory.

That's correct. You gain two things here by each processor only needing to load its small part of mesh (keep in mind that when loading from splits, you're forcing distributed mesh):

Each processor loads a smaller file from disk, which is faster for startup than every process loading the same, huge, full mesh file
You save the memory of each process needing to know the full mesh at the beginning in order to do partitioning. If you don't use split mesh, each process will load the full mesh, do the same partitioning, and then delete elems that are not needed locally

However, it also seems that you need to fit the whole mesh into memory in order to perform the splitting operations in the first place? If the mesh doesn't fit, you can't split it to alleviate that problem?

You should split the mesh only using a few processes. That is:

mpiexec -n 4 /path/to/app.opt -i input.it --split-mesh 600

Each process will still load the full mesh (so 4 copies of the mesh), and each will be responsible for saving 150 partitions in this case. As your mesh gets bigger, you can actually just run this in serial:

/path/to/app.opt -i input.it --split-mesh 600

and it'll take longer, but that's fine - it's only done once. As you get bigger and bigger meshes, most HPCs usually have high memory or visualization nodes. These are perfect for splitting meshes.

Is the idea that you can usually fit the whole mesh into memory when it is just the mesh (ie, when performing the splitting), but during a simulation you need to allocate some of that memory to the nonlinear and aux systems, for instance, so you now can't afford to store the whole mesh?

Kind of. But, when you get into meshes that are millions of elements... in general we recommend pre splitting always so you don't have to even think about this problem.

What is the difference between pre-splitting the mesh and setting --distributed-mesh, and setting parallel_type = distributed in the input file?

Like I said above. When you don't pre split but use distributed mesh, the mesh is serialized (loaded fully) on all processes, partitioned on all processes, and then deleted. Thus, when you don't do pre split, at one point each process has the full mesh in memory. You can sometimes get away with this because this memory jump is done before adding systems/vectors/matrices/etc.

The last thing to ask is whether the mesh is usually the memory bottleneck? I have a fairly large number of variables, and an especially large number of aux variables since I am extracting all my vector variable components due to #24193. Will these be distributed, or could this also be a problem? I wondered if this was the reason why I could successfully split a mesh, but mpi errors suggesting memory issues occur when I try and run the actual simulation

Whether or not you are replicated or distributed, vectors and matrices are always distributed. So - the overhead for this should not be different.

5 million elements doesn't seem completely unreasonable, so any insight into where I might be going wrong is greatly appreciated.

As I asked above - can you share how many degrees of freedom you have in your problem? That would give us an idea on usage.

1 reply

Eilloo Jul 31, 2023
Author

Thanks for the thorough explanation - that clears a lot up!
I can indeed split the mesh on a high memory node in this case
In terms of dofs, we don't get as far as printing that header (this is using the split mesh in 1200 parts, although I have also tried using fewer splits in case too many boundaries is somehow a problem). The last reported step is 'Ghosting Ghosted Boundaries".
We have 12 variables in the nonlinear system (counting vectors as 3), so around 60 million dofs(?). The auxiliary system is a similar order, 15 variables.

Eilloo · 2023-08-03T13:29:42Z

Eilloo
Aug 3, 2023
Author

As a followup question to this, could someone confirm exactly what the reported memory use in the output file is telling us?
I have been assuming that it is the memory used per process for each step, so we should multiply by the number of processors per node to get the total memory used per node (on a cluster). Is this correct?

Below is an example of the reporting I am referring to:

  Initializing
    Ghosting Ghosted Boundaries.                                                         [�[33m 18.59 s�[39m] [�[33m 2036 MB�[39m]
    Updating Because Mesh Changed
      Updating Mesh
        Finished Building Boundary Elements List                                         [�[33m  2.67 s�[39m] [�[33m 2455 MB�[39m]
      Finished Updating Mesh                                                             [�[33m 16.13 s�[39m] [�[33m 2490 MB�[39m]
    Finished Updating Because Mesh Changed                                               [�[33m 16.37 s�[39m] [�[33m 2573 MB�[39m]
    Updating Because Mesh Changed
      Updating Mesh
        Finished Building Boundary Elements List                                         [�[33m  2.59 s�[39m] [�[33m 2973 MB�[39m]
      Finished Updating Mesh                                                             [�[33m 15.52 s�[39m] [�[33m 3008 MB�[39m]
    Finished Updating Because Mesh Changed                                               [�[33m 15.74 s�[39m] [�[33m 3106 MB�[39m]
    Initializing Equation Systems..                                                      [�[33m 21.55 s�[39m] [�[33m 3387 MB�[39m]
    Initializing Displaced Equation System.                                              [�[33m 18.36 s�[39m] [�[33m 3622 MB�[39m]
    Updating Because Mesh Changed
      Updating Mesh                                                                      [�[33m 17.39 s�[39m] [�[33m 3664 MB�[39m]
    Finished Updating Because Mesh Changed                                               [�[33m 17.60 s�[39m] [�[33m 3664 MB�[39m]
  Finished Initializing                                                                  [�[33m108.22 s�[39m] [�[33m 3664 MB�[39m]

And:

Currently Executing
  Performing Initial Setup
    Updating Geometric Search
      Finding Nearest Nodes........                                                      [�[33m 49.86 s�[39m] [�[33m 3666 MB�[39m]
      Updating Displaced GeometricSearch
        Finding Nearest Nodes........                                                    [�[33m 48.65 s�[39m] [�[33m 3702 MB�[39m]
      Finished Updating Displaced GeometricSearch                                        [�[33m 48.65 s�[39m] [�[33m 3702 MB�[39m]
    Finished Updating Geometric Search                                                   [�[33m 98.51 s�[39m] [�[33m 3702 MB�[39m]
    Reinitializing Because of Geometric Search Objects...............                    [�[33m 88.36 s�[39m] [�[33m 4368 MB�[39m]
    Building SemiLocalElemMap.....                                                       [�[33m 36.10 s�[39m] [�[33m 4335 MB�[39m]
libMesh terminatinglibMesh terminating:

5 replies

GiudGiud Aug 3, 2023
Collaborator

It's roughly correct. It s the instantaneous current use, not the specific "step use"
It s not the most accurate thing because it only reports process 0, not the average use over all processes

Eilloo Aug 9, 2023
Author

Ok, thanks for clarifying.

I am suspicious that something is being replicated when it shouldn't be to be reaching such large memory usage numbers... using the approximation of a minimum from 20022 we get around 21Gb. I gather the actual requirements can still be much higher depending on preconditioner choice and materials, but this is many orders of magnitude below the total memory available in a parallel run (ie, summing the memory across all nodes).

Do you know of any way in MOOSE to effectively debug this kind of thing? Since the simulation is being killed during setup, I'm not sure I can make use of the memory usage and performance post-processors?

GiudGiud Aug 9, 2023
Collaborator

you can use memory profiling to keep track of memory requirements
look at heap profiling here
https://mooseframework.inl.gov/application_development/profiling.html

This will point you to the code that allocates large amounts of memory

Eilloo Aug 10, 2023
Author

Perfect, thanks - I'll look into this

Eilloo Aug 11, 2023
Author

Sorry to reopen an answered thread - I have one more question related to this:

@loganharbour, when you say the vectors and matrices are always distributed, do you know what the process for splitting is?

Through some playing around, it seems as though the process is analogous to using distributed mesh without pre-splitting. In other words, is memory allocated on every processor for the whole system, and then parts which are not needed by each process are deleted?

The reason it seems this way is that I stop running into memory issues if I run across fewer processes, so that each one has more memory available to it. Apart from ghosting, if everything was pre-split, this wouldn't make a difference - for instance, halving the number of processors would double the memory available to each, but also double the amount stored on each.

It therefore seems like the matrices are all replicated at first, or perhaps something in between. If this is true, is there any pre-splitting type operation that can be done?

loganharbour · 2023-08-11T15:26:49Z

loganharbour
Aug 11, 2023
Maintainer

@loganharbour, when you say the vectors and matrices are always distributed, do you know what the process for splitting is?

The way this is done is by partitioning, for which how this process done is equivalent for a given processor configuration (# procs), regardless of distributed or replicated mesh. This partitionining assigns nodal and elemental values in the mesh to a particular processor. Once this process is done, this denotes who owns what part of the vectors and matrices in the problem.

When the vectors and matrices are allocated (regardless of distributed or replicated mesh), they are still distributed. If you do not need ghosting for a vector or matrix, each processor will only allocate the degrees of freedom that it owns. If you need some ghosting, you'll have a bit more allocation here in order to have space for the entries that you need that other processors own.

The reason it seems this way is that I stop running into memory issues if I run across fewer processes, so that each one has more memory available to it.

This is likely because you have more available memory on a compute node, no? This is a common process. Say you have a compute node that has 256GB of memory a piece and 48 procs. If you request this whole node, you get 256 / 48 = 5.3 GB / proc. If you request the whole node but only use half the procs, you get 256 / 24 = 10.6GB / proc. That is - when you have a memory constrained problem, we often request fewer processors but request a whole node so that we have more memory available.

Are you sure that you're not just resolving your problems because you have more memory available?

It therefore seems like the matrices are all replicated at first, or perhaps something in between. If this is true, is there any pre-splitting type operation that can be done?

This isn't the case. The matricies and vectors (aside from ghosting) really are allocated in a distributed sense. With pre-split mesh, for a standard problem, there should be very little replicated data.

However, there is a chance that you could be using an algorithm or some capability that doesn't support distribution and does some replication. Can you share what you're trying to run?

3 replies

GiudGiud Aug 11, 2023
Collaborator

There's also some overhead to ghosting variables & the mesh (one layer on the edge of every domain by default) that does go down as the number of processes is reduced (and removed entirely in serial)

Eilloo Aug 15, 2023
Author

This is interesting - it does indeed sound like most of these issues are resolved due to having more memory available. However, I am not clear about why, say, half the number of processors with double the memory is any better overall from a memory point of view?

To use an over-simplified example, suppose the matrices in your problem take up 8Gb of memory; one compute node has 4Gb of memory in total, and each compute node has 2 processors. If you request two nodes with two processors per node, the matrices would be split into four 2Gb parts and each processor would allocate only values on its part.
However, if we request two nodes but only one processor per node, the matrices are split into two 4Gb parts instead, so even though there is more memory per processor, the total is always the same.

I am sure I am missing something key here - apologies if this is a silly question, but hopefully the example highlights where I am going wrong!

GiudGiud Aug 15, 2023
Collaborator

The total isn’t quite the same exactly because of the cost of ghosting
(And the distribution of the mesh)
There s also some marginal cost from creating all the objects in each process.

the key here is to measure. And probably make a table for the cost of each part of the simulation for 1,2,3… processes

yfun0815 · 2025-11-25T03:52:39Z

yfun0815
Nov 25, 2025

I have a couple of quite similar questions.

I did an experiment about the split mesh with a 14 million mesh. I split it by default settings. Case 1: split as 64 parts and each split file is about 3.5MB. Case 2: split as 256 parts and each split file is about 1.1MB. Looks fair only regarding to the size of file. Then, I loaded the mesh and recorded the first memory record by processor 0 after reading the mesh. Case 1: the memory occupation is 363MB and Case 2: the memory occupation is 358MB. Even if processor 0 cannot represent the whole story, I did not see any obvious difference among all processors when I top, neither. So, I suspect the program load something which is much larger than mesh itself when first reading that occupies most of the memory. Can anyone explain this?
I have the first question just because I need to scale my problem. I refined the 14 million mesh twice (64 times) which is about 0.9 billion in total. I split it into 2048 cores or 4096 cores. I run this case in HPC which has 64 cores and 512GB memory, so one core occupies 8GB. I monitored the memory usage when running, and I see each core occupied over 8GB and then the job is killed. If 350MB is a standard size for one core in my simulation, when it times 64, then 22GB should be allocated for one core! This is horrible for most HPC.

5 replies

GiudGiud Nov 25, 2025
Collaborator

Hello

How did you load the file?
Do you have ANY other mesh generator in there after loading the file?
Or exodus output? Anything that outputs " the mesh will be serialized " to the Console?

I monitored the memory usage when running, and I see each core occupied over 8GB and then the job is killed. If 350MB is a standard size for one core in my simulation, when it times 64, then 22GB should be allocated for one core! This is horrible for most HPC.

Yes agree. Let's get to the bottom of it. Many things can cause the mesh to be serialized

loganharbour Nov 25, 2025
Maintainer

So, I suspect the program load something which is much larger than mesh itself when first reading that occupies most of the memory. Can anyone explain this?

The binary representation of a mesh is the very base representation of the mesh. It doesn't include any of the caches, solution vectors, residual vectors, etc. You cannot reasonably compare the size of the binary mesh file to the entire mesh and solution state once loaded.

In general - we need to know much more about the problem that you are running to provide any reasonable insight. Steady state versus transient (additional solution vectors for time terms), additional physics (which may have their own caching), etc. The memory allocation for a process is completely dependent on what problem you're running. Providing memory usage without context is not appropriate.

Lastly, there is a cost to the generality that MOOSE awards you. Both in computational cost and memory usage. For many of these caches, it is often not possible to recognize at the time of initialization what features someone is going to use in the framework. This means that oftentimes they can be unused. Without the ability to profile a specific problem, there is little that we can do with just the numbers you have provided.

yfun0815 Nov 25, 2025

I attached my input file, in which I omit some similar parts which defines the material property and some kernels.

# primary model units (m | s | Pa | kg --> N/m²)

[Functions]

    [up_boundary_level]
        type = PiecewiseConstant
        x = '1.00E+07	2.00E+07	3.00E+07	4.00E+07	5.00E+07	6.00E+07	7.00E+07	8.00E+07	9.00E+07	1.00E+08	1.10E+08	1.20E+08	1.30E+08	1.40E+08	1.50E+08	1.60E+08	1.70E+08	1.80E+08	1.90E+08	2.00E+08	2.10E+08	2.20E+08	2.30E+08	2.40E+08	2.50E+08	2.60E+08	2.70E+08	2.80E+08	2.90E+08	3.00E+08	3.10E+08	3.20E+08	3.30E+08	3.40E+08	3.50E+08	3.60E+08	3.70E+08	3.80E+08	3.90E+08	4.00E+08	4.10E+08	4.20E+08	4.30E+08	4.40E+08	4.50E+08	4.60E+08	4.70E+08	4.80E+08	4.90E+08	5.00E+08	5.10E+08	5.20E+08	5.30E+08	5.40E+08	5.50E+08	5.60E+08	5.70E+08	5.80E+08	5.90E+08	6.00E+08	6.10E+08	6.20E+08	6.30E+08	6.40E+08	6.50E+08	6.60E+08	6.70E+08	6.80E+08	6.90E+08	7.00E+08	7.10E+08	7.20E+08	7.30E+08	7.40E+08	7.50E+08	7.60E+08	7.70E+08	7.80E+08	7.90E+08	8.00E+08	8.10E+08	8.20E+08	8.30E+08	8.40E+08	8.50E+08	8.60E+08	8.70E+08	8.80E+08	8.90E+08	9.00E+08	9.10E+08	9.20E+08	9.30E+08	9.40E+08	9.50E+08	9.60E+08	9.70E+08	9.80E+08	9.90E+08'
        y = '2228	2228	2228	2228	2228	2228	2228	2228	2228	2228	2228	2228	2228	2228	2228	2228	2228	2228	2228	2228	2228	2228	2228	2228	2235	2242	2249	2250	2256	2263	2270	2270	2270	2270	2270	2270	2270	2270	2270	2270	2270	2270	2270	2270	2270	2270	2270	2270	2270	2270	2270	2276	2282	2288	2294	2300	2306	2312	2318	2324	2330	2344.8	2349	2352	2356	2360	2363.81	2385	2410	2410	2420	2420	2420	2420	2420	2420	2420	2420	2420	2420	2420	2420	2420	2420	2420	2440	2460	2480	2500	2500	2500	2500	2500	2500	2500	2500	2500	2500	2500'
    []
    [down_boundary_level]
        type = PiecewiseConstant
        x = '1.00E+07	2.00E+07	3.00E+07	4.00E+07	5.00E+07	6.00E+07	7.00E+07	8.00E+07	9.00E+07	1.00E+08	1.10E+08	1.20E+08	1.30E+08	1.40E+08	1.50E+08	1.60E+08	1.70E+08	1.80E+08	1.90E+08	2.00E+08	2.10E+08	2.20E+08	2.30E+08	2.40E+08	2.50E+08	2.60E+08	2.70E+08	2.80E+08	2.90E+08	3.00E+08	3.10E+08	3.20E+08	3.30E+08	3.40E+08	3.50E+08	3.60E+08	3.70E+08	3.80E+08	3.90E+08	4.00E+08	4.10E+08	4.20E+08	4.30E+08	4.40E+08	4.50E+08	4.60E+08	4.70E+08	4.80E+08	4.90E+08	5.00E+08	5.10E+08	5.20E+08	5.30E+08	5.40E+08	5.50E+08	5.60E+08	5.70E+08	5.80E+08	5.90E+08	6.00E+08	6.10E+08	6.20E+08	6.30E+08	6.40E+08	6.50E+08	6.60E+08	6.70E+08	6.80E+08	6.90E+08	7.00E+08	7.10E+08	7.20E+08	7.30E+08	7.40E+08	7.50E+08	7.60E+08	7.70E+08	7.80E+08	7.90E+08	8.00E+08	8.10E+08	8.20E+08	8.30E+08	8.40E+08	8.50E+08	8.60E+08	8.70E+08	8.80E+08	8.90E+08	9.00E+08	9.10E+08	9.20E+08	9.30E+08	9.40E+08	9.50E+08	9.60E+08	9.70E+08	9.80E+08	9.90E+08'
        y = '2228	2228	2228	2228	2228	2228	2228	2228	2228	2228	2228	2228	2228	2228	2228	2228	2228	2228	2228	2228	2228	2228	2228	2228	2235	2242	2250	2250	2250	2250	2250	2250	2250	2250	2250	2250	2250	2250	2250	2250	2250	2250	2250	2250	2250	2250	2250	2250	2250	2250	2250	2250	2250	2250	2250	2250	2250	2250	2250	2250	2250	2250	2250	2250	2250	2250	2250	2250	2250	2250	2250	2250	2250	2250	2250	2250	2250	2250	2250	2250	2250	2250	2250	2250	2250	2250	2250	2250	2250	2250	2250	2250	2250	2250	2250	2250	2250	2250	2250'
    []
    [up_level_function]
        type = ParsedFunction
        expression = 'up_H'
        symbol_names = 'up_H'
        symbol_values = 'current_up_level'
    []
    [down_level_function]
        type = ParsedFunction
        expression = 'down_H'
        symbol_names = 'down_H'
        symbol_values = 'current_down_level'
    []

    [up_stream_pp_value]
        type = ParsedFunction
        symbol_names = 'g rho0 up_H'
        symbol_values = '9.81 1000 up_level_function'
        expression = 'if (z <= up_H, rho0 * g * (up_H - z), 0)'
    []
    [down_stream_pp_value]
        type = ParsedFunction
        symbol_names = 'g rho0 down_H'
        symbol_values = '9.81 1000 down_level_function'
        expression = 'if (z <= down_H, rho0 * g * (down_H - z), 0)'
    []
    [core_pp_value]
        type = ParsedFunction
        expression = '0'
    []
    
    [activate_10000000.0_schedule]
        type = ParsedFunction
        expression = 'if(t >= 0 & t < 20000000.0, 1, 0)'
    []
    [activate_20000000.0_schedule]
        type = ParsedFunction
        expression = 'if(t >= 20000000.0 & t < 30000000.0, 1, 0)'
    []
	…………
[]

[BCs]

    [archor_x]
        type = DirichletBC
        boundary = '1007'
        variable = disp_x
        value = 0
    []

    [archor_y]
        type = DirichletBC
        boundary = '1007'
        variable = disp_y
        value = 0
    []

    [archor_z]
        type = DirichletBC
        boundary = '1007'
        variable = disp_z
        value = 0
    []
    [up_stream_pp_809]
        type = FunctionDirichletBC
        variable = porepressure
        function = up_stream_pp_value
        boundary = '5000'
        enable = false
    []
    [down_stream_pp_908]
        type = FunctionDirichletBC
        variable = porepressure
        function = down_stream_pp_value
        boundary = '6000'
        enable = false
    []
    [Pressure]
        [Up_809]
            boundary = '809'
            function = up_stream_pp_value
            enable = false
        []
    []
    [Pressure]
        [Down_908]
            boundary = '908'
            function = down_stream_pp_value
            enable = false
        []
    []
    [Core_pp_31]
        type = FunctionDirichletBC
        variable = porepressure
        function = core_pp_value
        boundary = '7031'
        enable = false
    []
[]

[Controls]

  [activate_10000000.0]
    type = ConditionalFunctionEnableControl
    execute_on = 'TIMESTEP_BEGIN'
    enable_objects = 'BCs/up_stream_pp_809 BCs/down_stream_pp_908 BCs/Pressure/Up_809 BCs/Pressure/Down_908'
    reverse_on_false = 'true'
    conditional_function = activate_10000000.0_schedule
  []

  [activate_20000000.0]
    type = ConditionalFunctionEnableControl
    execute_on = 'TIMESTEP_BEGIN'
    enable_objects = 'BCs/up_stream_pp_810 BCs/down_stream_pp_909 BCs/Pressure/Up_810 BCs/Pressure/Down_909'
    reverse_on_false = 'true'
    conditional_function = activate_20000000.0_schedule
  []
  …………
[]

[Kernels]

  [gravity]
    type = Gravity
    variable = disp_z
    value = -9.8
    block = '${new_domain_blocks}'
  []
  [./stress_x]
    type = StressDivergenceTensors
    component = 0
    variable = disp_x
    block = '${new_domain_blocks}'
  [../]
  [./stress_y]
    type = StressDivergenceTensors
    component = 1
    variable = disp_y
    block = '${new_domain_blocks}'
  [../]
  [./stress_z]
    type = StressDivergenceTensors
    component = 2
    variable = disp_z
    block = '${new_domain_blocks}'
  [../]
  [effective_stress_x]
    type = PorousFlowEffectiveStressCoupling
    variable = 'disp_x'
    component = 0
    block = '${new_domain_blocks}'
  []

  [effective_stress_y]
    type = PorousFlowEffectiveStressCoupling
    variable = 'disp_y'
    component = 1
    block = '${new_domain_blocks}'
  []

  [effective_stress_z]
    type = PorousFlowEffectiveStressCoupling
    variable = 'disp_z'
    component = 2
    block = '${new_domain_blocks}'
  []

  [mass0]
    type = PorousFlowMassTimeDerivative
    fluid_component = 0
    variable = 'porepressure'
    block = '${new_domain_blocks}'
  []
  [flux]
    type = PorousFlowFullySaturatedDarcyBase
    use_displaced_mesh = false
    variable = porepressure
    block = '${new_domain_blocks}'
    gravity = '0 0 -9.8'
    enable = true
  []
  [poro_vol_exp]
    type = PorousFlowMassVolumetricExpansion
    variable = 'porepressure'
    fluid_component = 0
    block = '${new_domain_blocks}'
    enable = true
  []
   
[]
[Materials]

    [strain]
        type = ComputeIncrementalStrain
        block = '${new_domain_blocks}'
    []
    [ppss]
        type = PorousFlow1PhaseFullySaturated
        porepressure = porepressure
        block = '${new_domain_blocks}'
    []
    [stress1]
 	 type = ComputeFiniteStrainElasticStress1
 	 block = '2033 2043 2054 2065 2076 2087 2098 2109 2121 2133 2147 2161 2176 2191 2206 2221 2235 2249 2263 2278 2293 2308 2323 2338 2353 2368 2383 2398 2413 2429 2443 2457 2471 2484 2497 2510 2522 2534 2546 2558 2570 2582 2594 2606 2618 2630 2642 2654 2665 2676 2687 2698 2709 2720 2731 2742 2753 2764 2775 2784 2793 2802'
    []
    …………
    [elasticity_tensor1]
      type = ComputeElasticityTensorEB61
      block = '2033 2043 2054 2065 2076 2087 2098 2109 2121 2133 2147 2161 2176 2191 2206 2221 2235 2249 2263 2278 2293 2308 2323 2338 2353 2368 2383 2398 2413 2429 2443 2457 2471 2484 2497 2510 2522 2534 2546 2558 2570 2582 2594 2606 2618 2630 2642 2654 2665 2676 2687 2698 2709 2720 2731 2742 2753 2764 2775 2784 2793 2802'
      K = 328.0
      n = 0.42
      m = 0.38
      Kb = 272.0
      Kur = 656.0
      c = 92.0
      fai0 = 34.2
      dfai = 6.2
      Rf = 0.85
      poissons_ratio = 0.33
      pa = 1e5
      TPS3 = 0.1
    []
    …………
    [density1]
      type = GenericConstantMaterial
      block = '2033 2043 2054 2065 2076 2087 2098 2109 2121 2133 2147 2161 2176 2191 2206 2221 2235 2249 2263 2278 2293 2308 2323 2338 2353 2368 2383 2398 2413 2429 2443 2457 2471 2484 2497 2510 2522 2534 2546 2558 2570 2582 2594 2606 2618 2630 2642 2654 2665 2676 2687 2698 2709 2720 2731 2742 2753 2764 2775 2784 2793 2802'
      prop_names = density
      prop_values = 2130.0
    []
    …………
    [permeability_val1]
      type = PorousFlowPermeabilityConst
      block = '2033 2043 2054 2065 2076 2087 2098 2109 2121 2133 2147 2161 2176 2191 2206 2221 2235 2249 2263 2278 2293 2308 2323 2338 2353 2368 2383 2398 2413 2429 2443 2457 2471 2484 2497 2510 2522 2534 2546 2558 2570 2582 2594 2606 2618 2630 2642 2654 2665 2676 2687 2698 2709 2720 2731 2742 2753 2764 2775 2784 2793 2802'
      permeability = '5.1e-16 0 0 0 5.1e-16 0 0 0 5.1e-16'
    []
    …………
    [permeability_val24]
      type = PorousFlowPermeabilityConst
      block = '2002'
      permeability = '3.06e-11 0 0 0 3.06e-11 0 0 0 3.06e-11'
    []
    …………
    [porosity_val1]
      type = PorousFlowPorosityConst
      block = '2033 2043 2054 2065 2076 2087 2098 2109 2121 2133 2147 2161 2176 2191 2206 2221 2235 2249 2263 2278 2293 2308 2323 2338 2353 2368 2383 2398 2413 2429 2443 2457 2471 2484 2497 2510 2522 2534 2546 2558 2570 2582 2594 2606 2618 2630 2642 2654 2665 2676 2687 2698 2709 2720 2731 2742 2753 2764 2775 2784 2793 2802'
      porosity = 0.42
    []
	…………
    [temperature]
        type = PorousFlowTemperature
        block =  '${new_domain_blocks}'
    []
    [eff_fluid_pressure]
        type = PorousFlowEffectiveFluidPressure
        block =  '${new_domain_blocks}'
    []
    [vol_strain]
        type = PorousFlowVolumetricStrain
        block =  '${new_domain_blocks}'
    []
    [massfrac]
        type = PorousFlowMassFraction
        block =  '${new_domain_blocks}'
    []
    [simple_fluid]
        type = PorousFlowSingleComponentFluid
        fp = the_simple_fluid
        phase = 0
        block =  '${new_domain_blocks}'
    []
    [relperm]
        type = PorousFlowRelativePermeabilityCorey
        n = 0
        phase = 0
        s_res = 0.1
        sum_s_res = 0.1
        block =  '${new_domain_blocks}'
    []
[]

[Postprocessors]
  [current_up_level]
    type = FunctionValuePostprocessor
    function = up_boundary_level
    execute_on = 'TIMESTEP_BEGIN'
  []
  [current_down_level]
    type = FunctionValuePostprocessor
    function = down_boundary_level
    execute_on = 'TIMESTEP_BEGIN'
  []
[]

[Problem]
    kernel_coverage_check = false
    material_coverage_check = false
    boundary_restricted_node_integrity_check = false
    boundary_restricted_elem_integrity_check = false
[]

new_domain_blocks = '2001 to 2808'

[Mesh]
    [mesh]
        type = FileMeshGenerator
        file = output.msh
    []
    add_subdomain_ids = '${new_domain_blocks}'
    #parallel_type = DISTRIBUTED
[]


[GlobalParams]
    displacements = 'disp_x disp_y disp_z'
    use_displaced_mesh = false
    PorousFlowDictator = dictator
[]

[Variables]
  [disp_x]
      block = '${new_domain_blocks}'
  []
  [disp_y]
      block = '${new_domain_blocks}'
  []
  [disp_z]
      block = '${new_domain_blocks}'
  []
  [porepressure]
      block = '${new_domain_blocks}'
  []
[]

[UserObjects]
  [dictator]
    type = PorousFlowDictator
    porous_flow_vars = 'porepressure disp_x disp_y disp_z'
    number_fluid_phases = 1
    number_fluid_components = 1
  []
[]

[FluidProperties]
    [the_simple_fluid]
        type = SimpleFluidProperties
        bulk_modulus = 2e9
        viscosity = 1E-3
        density0 = 1000.0
    []
[]

[AuxVariables]
  [stress_maxprincipal]
    order = CONSTANT
    family = MONOMIAL
    block = '${new_domain_blocks}'
  []
  [stress_minprincipal]
    order = CONSTANT
    family = MONOMIAL
    block = '${new_domain_blocks}'
  []
  [saturation]
    family = MONOMIAL
    order = CONSTANT
    block = '${new_domain_blocks}'
  []
  [permeability]
    family = MONOMIAL
    order = CONSTANT
    block = '${new_domain_blocks}'
  []
  [porosity]
    order = CONSTANT
    family = MONOMIAL
    block = '${new_domain_blocks}'
  []

  [darcy_velocity_x]
    family = MONOMIAL
    order = CONSTANT
    block = '${new_domain_blocks}'
  []
  [darcy_velocity_y]
    family = MONOMIAL
    order = CONSTANT
    block = '${new_domain_blocks}'
  []
  [darcy_velocity_z]
    family = MONOMIAL
    order = CONSTANT
    block = '${new_domain_blocks}'
  []

  [stress_xx]
    family = MONOMIAL
    order = CONSTANT
    block = '${new_domain_blocks}'
  []
  [stress_yy]
    family = MONOMIAL
    order = CONSTANT
    block = '${new_domain_blocks}'
  []
  [stress_zz]
    family = MONOMIAL
    order = CONSTANT
    block = '${new_domain_blocks}'
  []
  [stress_xy]
    family = MONOMIAL
    order = CONSTANT
    block = '${new_domain_blocks}'
  []
  [stress_xz]
    family = MONOMIAL
    order = CONSTANT
    block = '${new_domain_blocks}'
  []
  [stress_yz]
    family = MONOMIAL
    order = CONSTANT
    block = '${new_domain_blocks}'
  []

  [strain_xx]
    family = MONOMIAL
    order = CONSTANT
    block = '${new_domain_blocks}'
  []
  [strain_yy]
    family = MONOMIAL
    order = CONSTANT
    block = '${new_domain_blocks}'
  []
  [strain_zz]
    family = MONOMIAL
    order = CONSTANT
    block = '${new_domain_blocks}'
  []
  [strain_xy]
    family = MONOMIAL
    order = CONSTANT
    block = '${new_domain_blocks}'
  []
  [strain_xz]
    family = MONOMIAL
    order = CONSTANT
    block = '${new_domain_blocks}'
  []
  [strain_yz]
    family = MONOMIAL
    order = CONSTANT
    block = '${new_domain_blocks}'
  []
[]

[AuxKernels]
  [stress_maxprincipal]
    type = RankTwoScalarAux
    rank_two_tensor = stress
    variable = stress_minprincipal
    scalar_type = MaxPrincipal
  []
  [stress_minprincipal]
    type = RankTwoScalarAux
    rank_two_tensor = stress
    variable = stress_maxprincipal
    scalar_type = MinPrincipal
  []
  [saturation]
    type = MaterialStdVectorAux
    property = PorousFlow_saturation_qp
    index = 0
    variable = saturation
  []
  [permeability]
    type = PorousFlowPropertyAux
    property = permeability
    column = 0
    row = 0
    variable = permeability
  []
  [porosity]
    type = PorousFlowPropertyAux
    property = porosity
    variable = porosity
    # execute_on = timestep_end
  []

  [darcy_velocity_x]
    type = PorousFlowDarcyVelocityComponent
    variable = darcy_velocity_x
    component = x
    gravity = '0 0 -9.8'
  []
  [darcy_velocity_y]
    type = PorousFlowDarcyVelocityComponent
    variable = darcy_velocity_y
    component = y
    gravity = '0 0 -9.8'
  []
  [darcy_velocity_z]
    type = PorousFlowDarcyVelocityComponent
    variable = darcy_velocity_z
    component = z
    gravity = '0 0 -9.8'
  []

  [stress_xx]
    type = RankTwoAux
    block = '${new_domain_blocks}'
    rank_two_tensor = stress
    index_i = 0
    index_j = 0
    variable = stress_xx
  []
  [stress_yy]
    type = RankTwoAux
    block = '${new_domain_blocks}'
    rank_two_tensor = stress
    index_i = 1
    index_j = 1
    variable = stress_yy
  []
  [stress_zz]
    type = RankTwoAux
    block = '${new_domain_blocks}'
    rank_two_tensor = stress
    index_i = 2
    index_j = 2
    variable = stress_zz
  []
  [stress_xy]
    type = RankTwoAux
    block = '${new_domain_blocks}'
    rank_two_tensor = stress
    index_i = 0
    index_j = 1
    variable = stress_xy
  []
  [stress_xz]
    type = RankTwoAux
    block = '${new_domain_blocks}'
    rank_two_tensor = stress
    index_i = 0
    index_j = 2
    variable = stress_xz
  []
  [stress_yz]
    type = RankTwoAux
    block = '${new_domain_blocks}'
    rank_two_tensor = stress
    index_i = 1
    index_j = 2
    variable = stress_yz
  []

  [strain_xx]
    type = RankTwoAux
    block = '${new_domain_blocks}'
    rank_two_tensor = total_strain
    index_i = 0
    index_j = 0
    variable = strain_xx
  []
  [strain_yy]
    type = RankTwoAux
    block = '${new_domain_blocks}'
    rank_two_tensor = total_strain
    index_i = 1
    index_j = 1
    variable = strain_yy
  []
  [strain_zz]
    type = RankTwoAux
    block = '${new_domain_blocks}'
    rank_two_tensor = total_strain
    index_i = 2
    index_j = 2
    variable = strain_zz
  []
  [strain_xy]
    type = RankTwoAux
    block = '${new_domain_blocks}'
    rank_two_tensor = total_strain
    index_i = 0
    index_j = 1
    variable = strain_xy
  []
  [strain_xz]
    type = RankTwoAux
    block = '${new_domain_blocks}'
    rank_two_tensor = total_strain
    index_i = 0
    index_j = 2
    variable = strain_xz
  []
  [strain_yz]
    type = RankTwoAux
    block = '${new_domain_blocks}'
    rank_two_tensor = total_strain
    index_i = 1
    index_j = 2
    variable = strain_yz
  []

[]
[MeshModifiers]
  [m1]
    type = TimedSubdomainModifier
    times = '1e7 to 99e7'
    blocks_from = '1 to 808'
    blocks_to = '2001 to 2808'
    execute_on = 'TIMESTEP_BEGIN'
  []
[]


# ===== Executioner =====
[Executioner]
    type = Transient
    automatic_scaling = true

    end_time = 99e07
    [TimeSteppers]
        [ConstDT1]
            type = ConstantDT
            dt = 1e07
        []
        [ConstDT2]
            type = IterationAdaptiveDT
            dt = 1e07
            cutback_factor = 0.8
            growth_factor = 1.5
        []
    []

    solve_type = NEWTON
	  petsc_options_iname = '-pc_type -ksp_type -ksp_gmres_restart -pc_factor_mat_ordering_type'
	  petsc_options_value = ' asm gmres     200 nd'

    nl_abs_tol = 1E-3
    nl_max_its = 200

    l_tol = 1E-10
    l_max_its = 200
[]

[Controls]
    [Time_control]
        type = TimePeriod
        enable_objects = 'TimeStepper::ConstDT1'
        disable_objects = 'TimeStepper::ConstDT2'
        start_time = '0'
        end_time = '91e07'
    []
[]

[Outputs]
    [out_vtk]
        type = VTK
        file_base = './test1_vtk/'
        time_step_interval = 10
    []
    [out_check]
        type = Checkpoint
        num_files = 1
        file_base = './test1_check/'
        time_step_interval = 10
    []
    perf_graph = true
[]

[Debug]
    show_var_residual_norms = true
[]

yfun0815 Nov 25, 2025

How did you load the file?
Do you have ANY other mesh generator in there after loading the file?
Or exodus output? Anything that outputs " the mesh will be serialized " to the Console?

@GiudGiud I load the mesh by mpirun -np 2048 moose -i test.i --use-split --split-file split.cpr. I'm sure I did not use other mesh generator. Does vtk or checkpoint output matters the mesh loading process? The output is:

Framework Information:
MOOSE Version:           git commit 832d6f35e1 on 2025-05-09
LibMesh Version:         8d1b4c43d1ab2283d4702aa86cf040d2ae2ec5c0
PETSc Version:           3.23.0
SLEPc Version:           3.23.0
Current Time:            Tue Nov 25 10:35:20 2025
Executable Timestamp:    Wed Nov 19 13:19:44 2025

Input File(s):
  /online1/paratera_wx_group/pwx0406/test1.i

Command Line Argument(s):
  --use-split
  --split-file=split.cpr

Checkpoint:
  Wall Time Interval:      Every 3600 s
  User Checkpoint:         Disabled
  # Checkpoints Kept:      2
  Execute On:              TIMESTEP_END 

Parallelism:
  Num Processors:          64
  Num Threads:             1

Mesh: 
  Parallel Type:           distributed (pre-split)
  Mesh Dimension:          3
  Spatial Dimension:       3
  Nodes:                   
    Total:                 291094
    Local:                 5489
    Min/Max/Avg:           3727/5989/4548
  Elems:                   
    Total:                 1555959
    Local:                 24072
    Min/Max/Avg:           24072/24550/24311
  Num Subdomains:          800
  Num Partitions:          64
  Partitioner:             default

Execution Information:
  Executioner:             Transient
  TimeStepper:             CompositionDT
  TimeIntegrator(s):       ImplicitEuler
  Solver Mode:             NEWTON
  PETSc Preconditioner:    asm 
  MOOSE Preconditioner:    SMP


Time Step 0, time = 0

Lastly, there is a cost to the generality that MOOSE awards you. Both in computational cost and memory usage. For many of these caches, it is often not possible to recognize at the time of initialization what features someone is going to use in the framework.

@loganharbour That makes perfect sense to me. How can I figure out what is not necessary for me and prevent the relevant memory allocation?

GiudGiud Nov 25, 2025
Collaborator

Does vtk or checkpoint output matters the mesh loading process? The output is:

vtk might be serialized. We use nemesis for distributed output
Checkpoint is fine.

Large mesh best practices and memory management #25053

Uh oh!

Eilloo Jul 28, 2023

Replies: 4 comments · 14 replies

Uh oh!

Uh oh!

loganharbour Jul 28, 2023 Maintainer

Uh oh!

Uh oh!

Eilloo Jul 31, 2023 Author

Uh oh!

Eilloo Aug 3, 2023 Author

Uh oh!

GiudGiud Aug 3, 2023 Collaborator

Uh oh!

Eilloo Aug 9, 2023 Author

Uh oh!

GiudGiud Aug 9, 2023 Collaborator

Uh oh!

Eilloo Aug 10, 2023 Author

Uh oh!

Eilloo Aug 11, 2023 Author

Uh oh!

loganharbour Aug 11, 2023 Maintainer

Uh oh!

GiudGiud Aug 11, 2023 Collaborator

Uh oh!

Eilloo Aug 15, 2023 Author

Uh oh!

GiudGiud Aug 15, 2023 Collaborator

Uh oh!

Uh oh!

yfun0815 Nov 25, 2025

Uh oh!

GiudGiud Nov 25, 2025 Collaborator

Uh oh!

Uh oh!

loganharbour Nov 25, 2025 Maintainer

Uh oh!

yfun0815 Nov 25, 2025

Uh oh!

yfun0815 Nov 25, 2025

Uh oh!

GiudGiud Nov 25, 2025 Collaborator

Eilloo
Jul 28, 2023

Replies: 4 comments 14 replies

loganharbour
Jul 28, 2023
Maintainer

Eilloo Jul 31, 2023
Author

Eilloo
Aug 3, 2023
Author

GiudGiud Aug 3, 2023
Collaborator

Eilloo Aug 9, 2023
Author

GiudGiud Aug 9, 2023
Collaborator

Eilloo Aug 10, 2023
Author

Eilloo Aug 11, 2023
Author

loganharbour
Aug 11, 2023
Maintainer

GiudGiud Aug 11, 2023
Collaborator

Eilloo Aug 15, 2023
Author

GiudGiud Aug 15, 2023
Collaborator

yfun0815
Nov 25, 2025

GiudGiud Nov 25, 2025
Collaborator

loganharbour Nov 25, 2025
Maintainer

GiudGiud Nov 25, 2025
Collaborator