E3SM-MMF run aborted with exit code 11 (segmentation fault) on an unsupported machine #7810
Replies: 2 comments 19 replies
-
|
first question - why are you using such an old branch? There were many MMF specific important updates since 2021, such as the variance transport scheme that official added to the master branch in 2022 that fixed a concerning checkerboard noise issue (see papers here and here). After that most fixes were not critical bug fixes since we diverted the development effort to other parts of the model, but there was still a good amount of code refactoring that happened. I would encourage you to switch to a modern branch. Even if you are specifically trying to reproduce an old experiment there is no capability in an old branch that is not preserved in the most recent master. Additionally, with the C++ version of the CRM (i.e. SAMxx) you will be able to run the CRM on GPUs for significant speed-up if your machine offers that. second question - what CRM configuration are you wanting to run? 2D vs 3D? The 2D CRM can represent momentum feedback via the "ESMT" scheme (see paper here). If I remember correctly, I created that old INCITE branch to do a special run that used the ne45pg2 grid along with a large 3D CRM and explicit momentum feedback. If you are wanting to do a 3D experiment there's a nuanced caveat that has to do with the application of surface friction that you should know about. You can reach out to me on slack for more details. While we don't officially support setting the model up on new machines I can provide some pointers.
Since I was one of the main users and devs of E3SM-MMF in its heyday I'm happy to spend a bit more effort to ensure it is still being used. I can provide quicker feedback if you find me on the E3SM slack. |
Beta Was this translation helpful? Give feedback.
-
|
To verify that your machine was ported to E3SM correctly, try a simpler case like --compset X --res f19_g16. |
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
-
Dear E3SM team,
I'm attempting to run E3SM-MMF with momentum feedback using the branch
whannah/incite2021/momentum-transport-updated on an unsupported machine.
After customizing config_machines.xml and config_compilers.xml, I was able to successfully build the model.
However, when I submitted the run job to the SLURM system, the job aborted within a few seconds.
The only message in the log file is as follows:
There is no additional output from any component model logs (no atm.log, lnd.log, etc.).
Software environment:
Branch: whannah/incite2021/momentum-transport-updated
Compiler: intel/2021.3.0
MPI: intelmpi/2021.3.0
NetCDF: 4.7.4
According to Intel MPI’s documentation, exit code 11 corresponds to a segmentation fault, which may be caused by out-of-memory.
However, even after switching to a large-memory partition (512 GB per node), the same error persists.
Could this be related to an MPI runtime issue or something going wrong during model initialization on unsupported machines?
Any suggestion on how to enable more detailed debugging or logging output would be greatly appreciated.
Thank you very much for your help!
Beta Was this translation helpful? Give feedback.
All reactions