Skip to content

large scale run failure with gt4py cpu backend #93

@xyuan

Description

@xyuan

when we scale out the gt4py backend using the pace/example test case upto 384 mpi ranks on gaea with/without openmp support, the run crashed with errors,

/ncrc/home1/Xingqiu.Yuan/miniconda3/envs/py3119/lib/python3.11/site-packages/gridtools_cpp/data/include/gridtools/stencil/cpu_kfirst.hpp(78): error: no instance of overloaded function "gridtools::sid::shift" matches the argument list
argument types are: (ptr_diff_t, gridtools::sid::default_stride, )
sid::shift(
^
/ncrc/home1/Xingqiu.Yuan/miniconda3/envs/py3119/lib/python3.11/site-packages/gridtools_cpp/data/include/gridtools/stencil/frontend/../../sid/concept.hpp(658): note: this candidate was rejected because at least one template argument could not be deduced
using concept_impl_::shift;

wtih openmp support, we have the following error,

/ncrc/home1/Xingqiu.Yuan/miniconda3/envs/py3119/lib/python3.11/site-packages/gridtools_cpp/data/include/gridtools/stencil/cpu_ifirst/loops.hpp(131): warning #16219: Some OpenMP processing was skipped to constrain compile time. Consider overriding limits (-qoverride-limits).
srun: error: c5n0890: task 90: Exited with exit code 1
srun: error: c5n1563: task 260: Killed
srun: error: c5n0890: task 111: Exited with exit code 1
srun: error: c5n1563: tasks 279,282: Killed
srun: error: c5n0890: task 98: Exited with exit code 1
srun: error: c5n1563: tasks 258,261,264,267,272,274,287,294,297: Killed

however, when the same case running with DaCe backend, it works fine.

Describe the system environment, include:
the modules used for the test are,
(base) Xingqiu.Yuan@gaea56:/gpfs/f5/gfdl_f/scratch/Xingqiu.Yuan/pace> module list

Currently Loaded Modules:

  1. craype-x86-rome 7) cray-mpich/8.1.25 13) TimeZoneEDT/default 19) uberftp/2_8 25) cray-netcdf/4.9.0.3
  2. craype-network-ofi 8) cray-libsci/23.02.1.1 14) DefApps/default 20) gcp/2.3 26) intel-oneapi/2023.1.0
  3. perftools-base/23.03.0 9) PrgEnv-intel/8.3.3 15) nccmp/1.9.0.1 21) hsm/1.3.0 27) boost/1.79.0
  4. xpmem/2.6.2-2.5_2.38__gd067c3f.shasta 10) cray-pmi/6.1.10 16) nco/5.0.1 22) perlbrew/5.28.0
  5. craype/2.7.20 11) darshan-runtime/3.4.0 17) fre-nctools/2024.03 23) fre/bronx-22
  6. cray-dsmml/0.2.2 12) CmrsEnv/default 18) gridcf-gct/6.2.20220524 24) cray-hdf5/1.12.2.3

when change it to gcc compiler, we have similar error

Metadata

Metadata

Assignees

Labels

bugSomething isn't working

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions