Skip to content

Conversation

@bishtgautam
Copy link
Contributor

@bishtgautam bishtgautam commented Aug 4, 2025

This adds infrastructure in ELM to use MOAB-based domain decomposition.
This is the first development step for supporting lateral connectivity across
ELM grid cells via MOAB.

[BFB]

@bishtgautam bishtgautam marked this pull request as draft August 4, 2025 21:51
@bishtgautam bishtgautam added Land MOAB Involves the MOAB library labels Aug 5, 2025
@bishtgautam bishtgautam force-pushed the bishtgautam/lnd/elm-moab-mesh branch from d105042 to 6056650 Compare August 5, 2025 21:46
@bishtgautam bishtgautam force-pushed the bishtgautam/lnd/elm-moab-mesh branch from 6056650 to 64fde3f Compare August 20, 2025 19:48
@bishtgautam bishtgautam changed the title [WIP] Adds MOAB-based domain decomposition for ELM Adds MOAB-based domain decomposition for ELM Aug 20, 2025
@bishtgautam bishtgautam marked this pull request as ready for review August 20, 2025 19:49
@rljacob rljacob requested a review from vijaysm August 21, 2025 21:34
@vijaysm
Copy link
Contributor

vijaysm commented Aug 21, 2025

@bishtgautam, does it include all of your refactored changes as well? I know that it requires a corresponding change in the iMOAB interface. I need to submit a MOAB PR for that.

@bishtgautam
Copy link
Contributor Author

@vijaysm Yes, this PR includes all of our combined changes.

@rljacob
Copy link
Member

rljacob commented Sep 18, 2025

@bishtgautam can you rebase to handle the conflict?

@bishtgautam bishtgautam force-pushed the bishtgautam/lnd/elm-moab-mesh branch from 64fde3f to 369193e Compare September 21, 2025 00:00
@bishtgautam
Copy link
Contributor Author

I have rebased the branch.

Copy link
Contributor

@vijaysm vijaysm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@bishtgautam Sorry about the delay in getting to this. Mostly just minor comments that can be easily addressed. I'll update the iMOABF routine and submit a PR for it in MOAB.

end do

! Set the data in MOAB tag
ierr = iMOAB_SetIntTagStorage(mlndghostid, tagname, moab_gcell%num_ghosted * numcomp, entity_type(1), data_int)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does num_ghosted have owned+ghost? Ideally, we would set only owned data, and the synchronization call will update the ghost layer data as well. Setting data to ghost elements does not have any effect here - but it is more convenient to do it this way?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • num_ghosted does correspond to owned + ghost cells
  • Only values corresponding to locally owned cells are being filled in the array data_int. The index of those locally owned cells is being obtained from moab_gcell%elm2moab(ln) at L2468.

if (ierr > 0) call endrun('Error: getting values failed')

! Get number of ghost quantites at all subgrid categories
procinfo%ncells_ghost = moab_gcell%num_ghost
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And num_ghost is only ghosted? Perhaps could use a different name for num_ghosted IMO for clarity.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • Yes, num_ghost corresponds to ghost (= non-owned cells).

I'm open to suggestions for a new name that reflects "owned + ghost" cells.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In MOAB, we call these as n_owned, n_ghost, n_local where getting all the local entities on any task will return n_owned+n_ghost. But you should do what is appropriate here for ELM. My suggestion was due to the fact that num_ghosted and num_ghost are too alike to differentiate one from the other.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like your suggested change and have created a new issue to fix it the future (#7745)


! get data about cell neighbors
num_neighbor = max_num_neighbor
ierr = iMOAB_GetNeighborElements(mlndghostid, g - 1, num_neighbor, neighbor_id) ! convert g from 1- to 0-based index
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am adding a reminder to myself to update the iMOAB F90 interface in MOAB.

! let us create the point-cloud MOAB mesh that the coupler needs
call init_moab_land(bounds, LNDID)
! now let us create that MOAB app that represents the full ELM mesh
! call init_moab_land_internal(bounds, LNDID)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I originally had this here only to realize later that the mesh needs to be instantiated way before in ELM. So you can probably remove L321-322. Can also remove the init_moab_land_internal implementation as well from lnd_comp_mct.F90

! !DESCRIPTION:
!
!
use seq_flds_mod , only : seq_flds_l2x_fields, seq_flds_x2l_fields
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can delete this entire subroutine as this has been well refactored and exposed now in elm_moab_initialize and MOABGridType

peterdschwartz added a commit that referenced this pull request Oct 10, 2025
This adds infrastructure in ELM to use MOAB-based domain decomposition.
This is the first development step for supporting lateral connectivity across
ELM grid cells via MOAB.

[BFB]
@peterdschwartz
Copy link
Contributor

merged to next

@rljacob
Copy link
Member

rljacob commented Oct 12, 2025

This is causing the MOAB coupled test on integration to fail. An update to local MOAB installs is needed before this can continue. @vijaysm

/gpfs/fs1/home/e3smtest/jenkins/workspace/ACME_chrysalis_next/E3SM/components/elm/src/data_types/MOABGridType.F90(306): error #6634: The shape matching rules of actual arguments and dummy arguments have been violated.   [NEIGHBOR_ID]
       ierr = iMOAB_GetNeighborElements(mlndghostid, g - 1, num_neighbor, neighbor_id) ! convert g from 1- to 0-based index
--------------------------------------------------------------------------^
compilation aborted for /gpfs/fs1/home/e3smtest/jenkins/workspace/ACME_chrysalis_next/E3SM/components/elm/src/data_types/MOABGridType.F90 (code 1)

@bishtgautam
Copy link
Contributor Author

@vijaysm, I believe this is the patch that is needed for MOAB

cat ~/packages/installations/moab_master/patch.txt
commit 11dcaa17abe793bf7d66776cd9235b4ec39d931a
Author: Gautam Bisht <[email protected]>
Date:   Thu Oct 9 11:55:22 2025 -0700

    Changes scalars to pointers
    
    The get subroutines are expected to return pointers instead of
    scalar values.

diff --git a/src/iMOABF.F90 b/src/iMOABF.F90
index bc1f17a52..197ce936a 100644
--- a/src/iMOABF.F90
+++ b/src/iMOABF.F90
@@ -310,7 +310,7 @@ module iMOAB
         integer(c_int), intent(in) :: pid
         integer(c_int), intent(in) :: local_index
         integer(c_int), intent(out) :: num_adjacent_elements
-        integer(c_int), intent(out) :: adjacent_element_IDs
+        integer(c_int), intent(out) :: adjacent_element_IDs(*)
       end function iMOAB_GetNeighborElements
 
       integer(c_int) function iMOAB_GetNeighborVertices(pid, local_index, num_adjacent_vertices, adjacent_vertex_IDs) &
@@ -319,7 +319,7 @@ module iMOAB
         integer(c_int), intent(in) :: pid
         integer(c_int), intent(in) :: local_index
         integer(c_int), intent(out) :: num_adjacent_vertices
-        integer(c_int), intent(out) :: adjacent_vertex_IDs
+        integer(c_int), intent(out) :: adjacent_vertex_IDs(*)
       end function iMOAB_GetNeighborVertices
 
       integer(c_int) function iMOAB_SetGlobalInfo(pid, num_global_verts, num_global_elems) bind(C, name='iMOAB_SetGlobalInfo')

@rljacob
Copy link
Member

rljacob commented Oct 13, 2025

Already taken care of: https://bitbucket.org/fathomteam/moab/pull-requests/751

@vijaysm
Copy link
Contributor

vijaysm commented Oct 13, 2025

I've already merged the changes to MOAB master and reinstalled on Chrysalis. Haven't updated the install elsewhere just yet. Will do it tonight.

peterdschwartz added a commit that referenced this pull request Oct 13, 2025
@bishtgautam bishtgautam force-pushed the bishtgautam/lnd/elm-moab-mesh branch from b22c40e to 735a66b Compare October 21, 2025 21:09
@bishtgautam
Copy link
Contributor Author

@peterdschwartz This is ready for reintegration, pending an available slot. Thanks

peterdschwartz added a commit that referenced this pull request Oct 22, 2025
This adds infrastructure in ELM to use MOAB-based domain decomposition.
This is the first development step for supporting lateral connectivity across
ELM grid cells via MOAB.

[BFB]
@peterdschwartz
Copy link
Contributor

on next

!
use elm_varctl , only : iulog ! for messages and domain file name
!
character(1024), intent(in) :: meshfile
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This length has to be consistent with elm_varctl definition for fatmlndfrc variable, which is defined as SHR_KIND_CL. The Intel build on Perlmutter is complaining that the length is inconsistent, and hence the build fails (https://my.cdash.org/tests/311892328).

Copy link
Contributor

@peterdschwartz peterdschwartz Oct 24, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

think it would be better to use variable length string for dumy arguments.

character(len=*), intent(in) :: meshfile

will test on chrysalis

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@peterdschwartz, I didn't see this message from you, and I already pushed a fix exactly as you proposed.

peterdschwartz added a commit that referenced this pull request Oct 24, 2025
@peterdschwartz
Copy link
Contributor

remerged to next

@peterdschwartz
Copy link
Contributor

the ERS_Vmoab.ne4pg2_oQU480.WCYCL1850NS has been failing due to time limit and seems to be hanging during initialization:

87: libpnetcdf.so.3.0  0000155554B3C62C  for__signal_handl     Unknown  Unknown
87: libpthread-2.28.s  000015554CBB1CF0  Unknown               Unknown  Unknown
87: hmca_bcol_basesmu  0000155538670144  hmca_bcol_basesmu     Unknown  Unknown
87: libhcoll.so.1.0.9  0000155542400707  Unknown               Unknown  Unknown
87: libmpi.so.40.30.6  0000155545AC6EEB  mca_coll_hcoll_al     Unknown  Unknown
87: libmpi.so.40.30.6  0000155545A69E05  PMPI_Allreduce        Unknown  Unknown
87: e3sm.exe           00000000061A5521  Unknown               Unknown  Unknown
87: e3sm.exe           0000000003876E93  moabgridtype_mp_e         187  MOABGridType.F90
87: e3sm.exe           000000000335ABF8  elm_initializemod         206  elm_initializeMod.F90
87: e3sm.exe           0000000003319D96  lnd_comp_mct_mp_l         296  lnd_comp_mct.F90
87: e3sm.exe           0000000000471578  component_mod_mp_         270  component_mod.F90
60: [chr-0498:1421954:0:1421954] Caught signal 11 (Segmentation fault: address not mapped

Here's the end of the lnd log file (no timesteps recorded):

 Attempting to read global land mask from
 /lcrc/group/e3sm/data/inputdata/share/domains/domain.lnd.ne4pg2_oQU480.200527.n
 c
 (GETFIL): attempting to find local file domain.lnd.ne4pg2_oQU480.200527.nc
 (GETFIL): using
 /lcrc/group/e3sm/data/inputdata/share/domains/domain.lnd.ne4pg2_oQU480.200527.n
 c
 Opened existing file
 /lcrc/group/e3sm/data/inputdata/share/domains/domain.lnd.ne4pg2_oQU480.200527.n
 c         106
 lat/lon grid flag (isgrid2d) is  F
 ncd_inqvid: variable LANDMASK is not on dataset

 register MOAB application:MOAB_ELM_GHOSTED^@, id=   538747256

 elm_moab_load_grid_file(): reading mesh file:
 /lcrc/group/e3sm/data/inputdata/share/domains/domain.lnd.ne4pg2_oQU480.200527.n
 c
 elm_moab_load_grid_file(): generating            1  ghost layers
~

@bishtgautam
Copy link
Contributor Author

@peterdschwartz, Thanks for providing the error trace.

@vijaysm, does this look like a MOAB-issue?

@vijaysm
Copy link
Contributor

vijaysm commented Oct 27, 2025

@peterdschwartz Yes, this is a MOAB issue, and I'm working on a fix. There is something unusual happening: a scalar Allreduce call is stalling, possibly due to a task exiting early. I was able to replicate this issue on Perlmutter on Friday, but it cleared after I reinstalled MOAB with an updated environment. I used create_newcase here instead of launching the test. But I see that Chrysalis is failing as well, so perhaps there is something deeper. I will debug it to determine the cause of the regression. We merged some warning fixes recently in MOAB, and I'll verify whether it changed the underlying workflow in any way.

@vijaysm
Copy link
Contributor

vijaysm commented Oct 27, 2025

@bishtgautam @peterdschwartz @rljacob This looks like a MOAB regression, actually. Reverting to an older hash on MOAB master makes the run work cleanly on Chrysalis and Perlmutter. I'll git bisect MOAB and figure out the issue today.

@bishtgautam
Copy link
Contributor Author

@vijaysm, Thanks for the update.

integer :: ierr

topodim = 2 ! topological dimension = 2: manifold mesh on the sphere
bridgedim = 1 ! use vertices = 0 as the bridge (other options: edges = 1)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@bishtgautam In #45953557090a3be59f1803fc18f110ce2907087c, I seemed to have changed bridgedim=1 from bridgedim=0? My original reasoning was to eliminate corner ghost elements on partition boundaries (which is kind of trivial in the bigger scheme of things), but using bridgedim=1 means that we expect the meshes to have edges defined. Reverting this back to bridgedim=0 fixes all issues. Can you please verify this change and retest? You do not need to rebuild MOAB for this to work.

@bishtgautam
Copy link
Contributor Author

@peterdschwartz could you please merge this PR again to next? thanks

peterdschwartz added a commit that referenced this pull request Oct 28, 2025
@peterdschwartz
Copy link
Contributor

remerged to next

@peterdschwartz
Copy link
Contributor

peterdschwartz commented Oct 29, 2025

Passed on chrysalis but build error on pm-cpu_intel: @bishtgautam @vijaysm

CMake Error at CMakeLists.txt:65 (find_package):
  By not providing "FindMOAB.cmake" in CMAKE_MODULE_PATH this project has
  asked CMake to find a package configuration file provided by "MOAB", but
  CMake did not find one.

  Could not find a package configuration file provided by "MOAB" with any of
  the following names:

    MOABConfig.cmake
    moab-config.cmake

  Add the installation prefix of "MOAB" to CMAKE_PREFIX_PATH or set
  "MOAB_DIR" to a directory containing one of the above files.  If "MOAB"
  provides a separate development package or SDK, be sure it has been
  installed.
ERROR: CMake Error at CMakeLists.txt:65 (find_package):
  By not providing "FindMOAB.cmake" in CMAKE_MODULE_PATH this project has
  asked CMake to find a package configuration file provided by "MOAB", but
  CMake did not find one.

  Could not find a package configuration file provided by "MOAB" with any of
  the following names:

    MOABConfig.cmake
    moab-config.cmake

  Add the installation prefix of "MOAB" to CMAKE_PREFIX_PATH or set
  "MOAB_DIR" to a directory containing one of the above files.  If "MOAB"
  provides a separate development package or SDK, be sure it has been
  installed.

@vijaysm
Copy link
Contributor

vijaysm commented Oct 29, 2025

@peterdschwartz I don't understand what that error exactly means. My create_test runs worked fine and @bishtgautam confirmed it too. I haven't changed the installation since yesterday afternoon. But this looks like the environment is messed up. Is there any way to restart/redo that job on pm-cpu? Or do we have to wait till tomorrow?

@peterdschwartz peterdschwartz merged commit f29cfa4 into master Oct 29, 2025
6 checks passed
@peterdschwartz peterdschwartz deleted the bishtgautam/lnd/elm-moab-mesh branch October 29, 2025 18:43
@peterdschwartz
Copy link
Contributor

issue is determined to be something with perlmutter rather than this PR. merged to master

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ELM land model MOAB Involves the MOAB library

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants