Skip to content

Conversation

@ndkeen
Copy link
Contributor

@ndkeen ndkeen commented Oct 31, 2025

For pm-cpu nvidia builds, add env var NVCOMPILER_TERM=trace to get more useful information with DEBUG builds (such as stack trace).

[bfb]

@ndkeen ndkeen self-assigned this Oct 31, 2025
@ndkeen ndkeen added Machine Files BFB PR leaves answers BFB pm-cpu Perlmutter at NERSC (CPU-only nodes) nvidia nvidia compiler (formerly PGI) labels Oct 31, 2025
@peterdschwartz peterdschwartz self-requested a review November 5, 2025 15:55
ndkeen added a commit that referenced this pull request Nov 5, 2025
… (PR #7852)

For pm-cpu nvidia builds, add env var NVCOMPILER_TERM=trace to get more useful information with DEBUG builds (such as stack trace).

[bfb]
@ndkeen
Copy link
Contributor Author

ndkeen commented Nov 5, 2025

merged to next

Note with this change, we see this type of output with DEBUG nvidia fails (before there was no trace):

  0: Error: floating point exception, floating point invalid operation
  0:    rax 0x000000000000000c, rbx 0x000000000f18be00, rcx 0x00000000000003fd
  0:    rdx 0x40862e42fefa39ef, rsp 0x00007fff9a613580, rbp 0x00007fff9a616130
  0:    rsi 0x0000000000000001, rdi 0x0000000000000001, r8  0x00000000000000ae
  0:    r9  0x0000000003e61c79, r10 0x0000000000000001, r11 0x00001492514ebc20
  0:    r12 0x00007fff9a618ca8, r13 0x00007fff9a618cc0, r14 0x00007fff9a618090
  0:    r15 0x0000000004357e68
  0:   /lib64/libpthread.so.0(+0x16910) [0x149256e2d910]
  0:   /opt/nvidia/hpc_sdk/Linux_x86_64/25.5/compilers/lib/libnvcpumath.so(__mth_i_dexp_avx2+0x168) [0x14924f383658]
  0:   /pscratch/sd/n/ndk/e3sm_scratch/pm-cpu/nexty-oct30/ERS_D.f09_f09.IELM.pm-cpu_nvidia.elm-koch_snowflake.cdash-traceenv/bld/e3sm.exe(initverticalmod_initvertical_: initverticalmod_initvertical_ at /global/cfs/cdirs/e3sm/ndk/repos/nexty-oct30/components/elm/src/main/i\
nitVerticalMod.F90:174) [0x112f2f6]
  0:   /pscratch/sd/n/ndk/e3sm_scratch/pm-cpu/nexty-oct30/ERS_D.f09_f09.IELM.pm-cpu_nvidia.elm-koch_snowflake.cdash-traceenv/bld/e3sm.exe(elm_instmod_elm_inst_biogeophys_: elm_instmod_elm_inst_biogeophys_ at /global/cfs/cdirs/e3sm/ndk/repos/nexty-oct30/components/elm/src/\
main/elm_instMod.F90:386) [0x1024507]
  0:   /pscratch/sd/n/ndk/e3sm_scratch/pm-cpu/nexty-oct30/ERS_D.f09_f09.IELM.pm-cpu_nvidia.elm-koch_snowflake.cdash-traceenv/bld/e3sm.exe(elm_initializemod_initialize2_: elm_initializemod_initialize2_ at /global/cfs/cdirs/e3sm/ndk/repos/nexty-oct30/components/elm/src/main\
/elm_initializeMod.F90:706) [0x101f41c]
  0:   /pscratch/sd/n/ndk/e3sm_scratch/pm-cpu/nexty-oct30/ERS_D.f09_f09.IELM.pm-cpu_nvidia.elm-koch_snowflake.cdash-traceenv/bld/e3sm.exe(lnd_comp_mct_lnd_init_mct_: lnd_comp_mct_lnd_init_mct_ at /global/cfs/cdirs/e3sm/ndk/repos/nexty-oct30/components/elm/src/cpl/lnd_comp\
_mct.F90:353) [0xf8ef2d]
  0:   /pscratch/sd/n/ndk/e3sm_scratch/pm-cpu/nexty-oct30/ERS_D.f09_f09.IELM.pm-cpu_nvidia.elm-koch_snowflake.cdash-traceenv/bld/e3sm.exe(component_mod_component_init_cc_: component_mod_component_init_cc_ at /global/cfs/cdirs/e3sm/ndk/repos/nexty-oct30/driver-mct/main/com\
ponent_mod.F90:259) [0x848dc0]
  0:   /pscratch/sd/n/ndk/e3sm_scratch/pm-cpu/nexty-oct30/ERS_D.f09_f09.IELM.pm-cpu_nvidia.elm-koch_snowflake.cdash-traceenv/bld/e3sm.exe(cime_comp_mod_cime_init_: cime_comp_mod_cime_init_ at /global/cfs/cdirs/e3sm/ndk/repos/nexty-oct30/driver-mct/main/cime_comp_mod.F90:1\
524) [0x80fcc8]
  0:   /pscratch/sd/n/ndk/e3sm_scratch/pm-cpu/nexty-oct30/ERS_D.f09_f09.IELM.pm-cpu_nvidia.elm-koch_snowflake.cdash-traceenv/bld/e3sm.exe(MAIN_: MAIN_ at /global/cfs/cdirs/e3sm/ndk/repos/nexty-oct30/driver-mct/main/cime_driver.F90:124) [0x845d4b]
  0:   /pscratch/sd/n/ndk/e3sm_scratch/pm-cpu/nexty-oct30/ERS_D.f09_f09.IELM.pm-cpu_nvidia.elm-koch_snowflake.cdash-traceenv/bld/e3sm.exe(main+0x31) [0x800931]
  0:   /lib64/libc.so.6(__libc_start_main+0xef) [0x149250c3e1fd]
  0:   /pscratch/sd/n/ndk/e3sm_scratch/pm-cpu/nexty-oct30/ERS_D.f09_f09.IELM.pm-cpu_nvidia.elm-koch_snowflake.cdash-traceenv/bld/e3sm.exe(_start: _start at /home/abuild/rpmbuild/BUILD/glibc-2.31/csu/../sysdeps/x86_64/start.S:122) [0x80081a]

@ndkeen ndkeen merged commit 77fc39f into master Nov 6, 2025
6 checks passed
@ndkeen ndkeen deleted the ndk/machinefiles/pm-cpu-nvidia-trace-env-var branch November 6, 2025 17:46
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

BFB PR leaves answers BFB Machine Files nvidia nvidia compiler (formerly PGI) pm-cpu Perlmutter at NERSC (CPU-only nodes)

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants