Use ieee intrinsics to avoid trapping ieee_inexact signals #7867

peterdschwartz · 2025-11-07T18:01:32Z

Potentially due to a compiler bug in nvhpc, ELM is halting on ieee_inexact values when Ktrap=fp mode is set.

peterdschwartz · 2025-11-07T18:02:27Z

going to perform sanity checks on chrysalis

ndkeen · 2025-11-07T18:39:27Z

Noting a reproducer: ERS_D.f09_f09.IELM.pm-cpu_nvidia.elm-koch_snowflake

As it looks to only be a problem with nvidia compiler, it might be good to put this test under

#ifdef CPRNVIDIA

?

peterdschwartz · 2025-11-10T14:24:40Z

Adding ifdefs is unnecessary: trapping inexact is always off for all compilers and machines, including pm-cpu_nvidia. This workaround appears necessary due to a bug with nvidia's math runtime. If someone wanted to turn on traps for inexact, the correct way would be to use the ieee intrinsics around the code of interest rather than compile flags.

ndkeen · 2025-11-10T19:35:59Z

We are unable to reproduce this behavior with a simple example.

I tried simply:

       do j = 1, nlevgrnd
          exponent_arg = zecoeff * (j-0.5_r8)
          IF ( exponent_arg > 700.0_r8 ) THEN
             exponent_arg = 700.0_r8
          END IF
          !write(*,'(a,i8,es20.10,es20.10)') " ndk j, exponent_arg, scalez=", j, exponent_arg, scalez                                                                                                                                                                         
          !write(*,'(a,i8,es20.10)') " ndk j, exp(exponent_arg)", j, exp(exponent_arg)                                                                                                                                                                                        
          zsoi(j) = scalez*(exp(exponent_arg)-1._r8)

          !ndk zsoi(j) = scalez*(exp(zecoeff*(j-0.5_r8))-1._r8)    !node depths                                                                                                                                                                                               
       enddo

in both places where we see the error and the test passes.

So I'm not sure what's best to do here

rljacob · 2025-11-10T19:50:28Z

components/elm/src/biogeophys/SoilStateType.F90

    if (.not. readvar ) then
-       !    Variable ZSOI not found, use the ELM parameters.
+       if (ieee_support_halting(ieee_inexact)) then
+          call ieee_set_flag(ieee_all,.false.)


This line turns off ALL floating point trapping which would override our DEBUG settings. This would have to be wrapped in an #ifdef CPRNIVIDIA if you really want this.

rljacob · 2025-11-10T19:52:21Z

components/elm/src/biogeophys/SoilStateType.F90

-       !    Variable ZSOI not found, use the ELM parameters.
+       if (ieee_support_halting(ieee_inexact)) then
+          call ieee_set_flag(ieee_all,.false.)
+          call ieee_set_halting_mode(ieee_inexact, .false.)


This will turn off ieee_inexact trapping for everything that runs after this making it a global setting, not just for the land (unless the land is running on its own tasks). If we want to set these things globally, it should be done in the driver. Or you should set it back to "on".

ieee_inexact trappings are already off for everything. None of our code could work if not the case.

this is essentially a no-op except for this nvida on perlmutter, which i can only hazard is a bug in the compiler or runtime.

Oh because ieee_support_halting(ieee_inexact) will be False everywhere except NVIDIA?

it's even false with nvidia ! From the nvidia docs our current flag only captures inv, divz, ovf. iexact is really never used because all transcendental functions are inexact. I have verified this by calling ieee_get_halting_mode(ieee_inexact, halt) which is .false. wherever i call it.

The special flag -Ktrap=none is used to preserve FPEs during compilation without unmasking any of them at runtime. The inv, divz, and ovf flags are often the most interesting, as they signify abnormal floating-point behavior in the program. These can be enabled with the useful shorthand -Ktrap=fp.

it seems to be a bug in the code gen or AVX2 exp runtime. Found a similar issue of random inexact being tripped in 25.9 https://forums.developer.nvidia.com/t/nvfortran-25-9-spurious-floating-point-exception/346604

sorry not the support halting but the actual halting flag is false

peterdschwartz · 2025-11-10T20:29:26Z

@rljacob For completeness here are the debugging statements i tested to confirm what's happening (to the best of my knowledge at least)

! TOP OF SUBROUTINE
    block
      logical :: halt_inexact

      call ieee_get_halting_mode(ieee_inexact, halt_inexact)
      print *, "halt_inexact AT THE TOP: ",halt_inexact

    end block
......
block
         logical :: inv, divz, ovf, unf, inx
         call ieee_set_flag(ieee_all, .false.)
         do j = 1, nlevgrnd
            zsoi(j) = scalez*(exp(zecoeff*(dble(j)-0.5_r8))-1._r8)    !node depths
            call ieee_get_flag(ieee_invalid,         inv)
            call ieee_get_flag(ieee_divide_by_zero,  divz)
            call ieee_get_flag(ieee_overflow,        ovf)
            call ieee_get_flag(ieee_underflow,       unf)
            call ieee_get_flag(ieee_inexact,         inx)
            write(*,*) 'Flags after exp: inv=',inv,' divz=',divz,' ovf=',ovf,' unf=',unf,' inx=',inx
         enddo
      end block

output:

halt_inexact AT THE TOP:   F
 Flags after exp: inv=  F  divz=  F  ovf=  F  unf=  F  inx=  T
 Flags after exp: inv=  F  divz=  F  ovf=  F  unf=  F  inx=  T
 Flags after exp: inv=  F  divz=  F  ovf=  F  unf=  F  inx=  T
 Flags after exp: inv=  F  divz=  F  ovf=  F  unf=  F  inx=  T
 Flags after exp: inv=  F  divz=  F  ovf=  F  unf=  F  inx=  T
 Flags after exp: inv=  F  divz=  F  ovf=  F  unf=  F  inx=  T
 Flags after exp: inv=  F  divz=  F  ovf=  F  unf=  F  inx=  T
 Flags after exp: inv=  F  divz=  F  ovf=  F  unf=  F  inx=  T
 Flags after exp: inv=  F  divz=  F  ovf=  F  unf=  F  inx=  T
 Flags after exp: inv=  F  divz=  F  ovf=  F  unf=  F  inx=  T
 Flags after exp: inv=  F  divz=  F  ovf=  F  unf=  F  inx=  T
 Flags after exp: inv=  F  divz=  F  ovf=  F  unf=  F  inx=  T
 Flags after exp: inv=  F  divz=  F  ovf=  F  unf=  F  inx=  T
 Flags after exp: inv=  F  divz=  F  ovf=  F  unf=  F  inx=  T
 Flags after exp: inv=  F  divz=  F  ovf=  F  unf=  F  inx=  T

ndkeen · 2025-11-12T17:10:19Z

What are we trying to show regarding the inexact flag?

muller-login02% cat ieee-nvidia-inexact.F90
! ftn -fpp -fpe0 ieee-nvidia-inexact.F90
program test

   use, intrinsic :: ieee_arithmetic  ! use this with ieee_usual
   !use, intrinsic :: ieee_exceptions ! use this for IEEE_DIVIDE_BY_ZERO
   implicit none
   logical :: halt_inexact, inv, divz, ovf, unf, inx
   integer :: j
   integer, parameter :: nlevgrnd=6
   real(8) :: zsoi(nlevgrnd), scalez, zecoeff

   scalez=0.25
   zecoeff=0.25

   call ieee_get_flag(ieee_invalid,         inv)
   call ieee_get_flag(ieee_divide_by_zero,  divz)
   call ieee_get_flag(ieee_overflow,        ovf)
   call ieee_get_flag(ieee_underflow,       unf)
   call ieee_get_flag(ieee_inexact,         inx)
   write(*,*) 'Flags initial: inv=',inv,' divz=',divz,' ovf=',ovf,' unf=',unf,' inx=',inx

   call ieee_get_halting_mode(ieee_inexact, halt_inexact)
   print *, "halt_inexact AT THE TOP: ",halt_inexact


   call ieee_set_flag(ieee_all, .false.)
   do j = 1, nlevgrnd
      zsoi(j) = scalez*(exp(zecoeff*(dble(j)-0.5))-1.0)    !node depths
      call ieee_get_flag(ieee_invalid,         inv)
      call ieee_get_flag(ieee_divide_by_zero,  divz)
      call ieee_get_flag(ieee_overflow,        ovf)
      call ieee_get_flag(ieee_underflow,       unf)
      call ieee_get_flag(ieee_inexact,         inx)
      write(*,*) 'Flags after exp: inv=',inv,' divz=',divz,' ovf=',ovf,' unf=',unf,' inx=',inx
   enddo

end program test

muller-login02% cat ieee-build.sh 
#!/usr/bin/env bash

source /usr/share/lmod/lmod/init/bash
module load cpu

echo "gnu"
module -q load PrgEnv-gnu
ftn ieee-nvidia-inexact.F90
./a.out > o.gnu.txt

echo "gnu with -ffpe-trap=invalid,zero,overflow"
module -q load PrgEnv-gnu
ftn -g -Wall -fbacktrace -fcheck=bounds -ffpe-trap=invalid,zero,overflow ieee-nvidia-inexact.F90
./a.out > o.gnu-trap.txt



echo "intel"
module -q load PrgEnv-intel
ftn -diag-disable=10448 ieee-nvidia-inexact.F90
./a.out > o.intel.txt

echo "intel with -fpe0"
module -q load PrgEnv-intel
ftn -O0 -g -check uninit -check bounds -check pointers -fpe0 -check noarg_temp_created -init=snan,arrays -diag-disable=10448 ieee-nvidia-inexact.F90
./a.out > o.intel-trap.txt



echo "nvidia"
module -q load PrgEnv-nvidia
ftn -Wl,-z,noexecstack ieee-nvidia-inexact.F90
./a.out > o.nvida.txt

echo "nvidia with -Ktrap=fp"
module -q load PrgEnv-nvidia
ftn -O0 -g -Ktrap=fp -Mbounds -Kieee -Wl,-z,noexecstack ieee-nvidia-inexact.F90
./a.out > o.nvidia-trap.txt

muller-login02% ./ieee-build.sh 
gnu
gnu with -ffpe-trap=invalid,zero,overflow
intel
intel with -fpe0
nvidia
nvidia with -Ktrap=fp
1.334u 0.493s 0:01.85 98.3% 0pf+0w


muller-login02% grep Flag o.* | column -t
o.gnu-trap.txt:     Flags  initial:  inv=  F     divz=  F      ovf=  F     unf=  F     inx=  F     
o.gnu-trap.txt:     Flags  after     exp:  inv=  F      divz=  F     ovf=  F     unf=  F     inx=  T
o.gnu-trap.txt:     Flags  after     exp:  inv=  F      divz=  F     ovf=  F     unf=  F     inx=  T
o.gnu-trap.txt:     Flags  after     exp:  inv=  F      divz=  F     ovf=  F     unf=  F     inx=  T
o.gnu-trap.txt:     Flags  after     exp:  inv=  F      divz=  F     ovf=  F     unf=  F     inx=  T
o.gnu-trap.txt:     Flags  after     exp:  inv=  F      divz=  F     ovf=  F     unf=  F     inx=  T
o.gnu-trap.txt:     Flags  after     exp:  inv=  F      divz=  F     ovf=  F     unf=  F     inx=  T
o.gnu.txt:          Flags  initial:  inv=  F     divz=  F      ovf=  F     unf=  F     inx=  F     
o.gnu.txt:          Flags  after     exp:  inv=  F      divz=  F     ovf=  F     unf=  F     inx=  T
o.gnu.txt:          Flags  after     exp:  inv=  F      divz=  F     ovf=  F     unf=  F     inx=  T
o.gnu.txt:          Flags  after     exp:  inv=  F      divz=  F     ovf=  F     unf=  F     inx=  T
o.gnu.txt:          Flags  after     exp:  inv=  F      divz=  F     ovf=  F     unf=  F     inx=  T
o.gnu.txt:          Flags  after     exp:  inv=  F      divz=  F     ovf=  F     unf=  F     inx=  T
o.gnu.txt:          Flags  after     exp:  inv=  F      divz=  F     ovf=  F     unf=  F     inx=  T
o.intel-trap.txt:   Flags  initial:  inv=  F     divz=  F      ovf=  F     unf=  F     inx=  F     
o.intel-trap.txt:   Flags  after     exp:  inv=  F      divz=  F     ovf=  F     unf=  F     inx=  T
o.intel-trap.txt:   Flags  after     exp:  inv=  F      divz=  F     ovf=  F     unf=  F     inx=  T
o.intel-trap.txt:   Flags  after     exp:  inv=  F      divz=  F     ovf=  F     unf=  F     inx=  T
o.intel-trap.txt:   Flags  after     exp:  inv=  F      divz=  F     ovf=  F     unf=  F     inx=  T
o.intel-trap.txt:   Flags  after     exp:  inv=  F      divz=  F     ovf=  F     unf=  F     inx=  T
o.intel-trap.txt:   Flags  after     exp:  inv=  F      divz=  F     ovf=  F     unf=  F     inx=  T
o.intel.txt:        Flags  initial:  inv=  F     divz=  F      ovf=  F     unf=  F     inx=  F     
o.intel.txt:        Flags  after     exp:  inv=  F      divz=  F     ovf=  F     unf=  F     inx=  T
o.intel.txt:        Flags  after     exp:  inv=  F      divz=  F     ovf=  F     unf=  F     inx=  T
o.intel.txt:        Flags  after     exp:  inv=  F      divz=  F     ovf=  F     unf=  F     inx=  T
o.intel.txt:        Flags  after     exp:  inv=  F      divz=  F     ovf=  F     unf=  F     inx=  T
o.intel.txt:        Flags  after     exp:  inv=  F      divz=  F     ovf=  F     unf=  F     inx=  T
o.intel.txt:        Flags  after     exp:  inv=  F      divz=  F     ovf=  F     unf=  F     inx=  T
o.nvida.txt:        Flags  initial:  inv=  F     divz=  F      ovf=  F     unf=  F     inx=  F     
o.nvida.txt:        Flags  after     exp:  inv=  F      divz=  F     ovf=  F     unf=  F     inx=  T
o.nvida.txt:        Flags  after     exp:  inv=  F      divz=  F     ovf=  F     unf=  F     inx=  T
o.nvida.txt:        Flags  after     exp:  inv=  F      divz=  F     ovf=  F     unf=  F     inx=  T
o.nvida.txt:        Flags  after     exp:  inv=  F      divz=  F     ovf=  F     unf=  F     inx=  T
o.nvida.txt:        Flags  after     exp:  inv=  F      divz=  F     ovf=  F     unf=  F     inx=  T
o.nvida.txt:        Flags  after     exp:  inv=  F      divz=  F     ovf=  F     unf=  F     inx=  T
o.nvidia-trap.txt:  Flags  initial:  inv=  F     divz=  F      ovf=  F     unf=  F     inx=  F     
o.nvidia-trap.txt:  Flags  after     exp:  inv=  F      divz=  F     ovf=  F     unf=  F     inx=  T
o.nvidia-trap.txt:  Flags  after     exp:  inv=  F      divz=  F     ovf=  F     unf=  F     inx=  T
o.nvidia-trap.txt:  Flags  after     exp:  inv=  F      divz=  F     ovf=  F     unf=  F     inx=  T
o.nvidia-trap.txt:  Flags  after     exp:  inv=  F      divz=  F     ovf=  F     unf=  F     inx=  T
o.nvidia-trap.txt:  Flags  after     exp:  inv=  F      divz=  F     ovf=  F     unf=  F     inx=  T
o.nvidia-trap.txt:  Flags  after     exp:  inv=  F      divz=  F     ovf=  F     unf=  F     inx=  T

peterdschwartz · 2025-11-12T18:29:54Z

@ndkeen
o.gnu-trap.txt: Flags after exp: inv= F divz= F ovf= F unf= F inx= T

this shows that the only ieee flag being set from the exp is the ieee_inexact flag. This proves that the program isn't catching an ieee_invalid, ieee_overflow, or any of the others. Note that every call to exp (and any other math transcendental function) sets the ieee_inexact flag, which is why compilers do not trap it by default and some don't even allow for trapping it on certain architectures.

You did not show the output from this line:

   call ieee_get_halting_mode(ieee_inexact, halt_inexact)
   print *, "halt_inexact AT THE TOP: ",halt_inexact

But if you did, that would tell you if the program is set to halt (ie trap) the ieee_inexact. You should find that none of those compilers with those flags do, or if they did, the program would halt with a SIGFPE

ndkeen · 2025-11-12T18:37:05Z

muller-login02% grep halt o.* | column -t
o.gnu-trap.txt:     halt_inexact  AT  THE  TOP:  F
o.gnu.txt:          halt_inexact  AT  THE  TOP:  F
o.intel-trap.txt:   halt_inexact  AT  THE  TOP:  F
o.intel.txt:        halt_inexact  AT  THE  TOP:  F
o.nvida.txt:        halt_inexact  AT  THE  TOP:  F
o.nvidia-trap.txt:  halt_inexact  AT  THE  TOP:  F

you are correct, none have it.

But while its odd that inexact is showing T, it's not clear why nvidia would behave differently than others?
I mean, can we modify this test to show?

peterdschwartz · 2025-11-12T18:44:28Z

It's not odd to show inexact = T after an exp. Based on this thread https://forums.developer.nvidia.com/t/nvfortran-25-9-spurious-floating-point-exception/346604

Someone was able to make a small reproducer with 25.9 and with -O1 flag. It may be worth simply copying that code and seeing if we get the same results. But Matt Colgrove confirms that it shouldn't be happening and is a compiler bug.

ndkeen · 2025-11-12T21:45:12Z

OK, using that test code provided, I was able to get a signal. If I build with debug flags and -O2, I can get a FP exception (only with nvidia compiler). However, with e3sm, we arent building with -O2. Also, for test problem, I can get the error to go away (as suggested) with -Mnoinline, which is actually a reasonable thing to add to our e3sm DEBUG flags. I tried this, but I get the error (although different location). I then see that certain libs (outside of e3sm.bldlog) are for some reason not using all of the fortran debug flags specified. When I add them to be consistent across all fortran builds, I still get the error.

  0: Error: floating point exception, floating point invalid operation
  0:    rax 0x0000000000000080, rbx 0x00007ffdcc751cc0, rcx 0x0000000000000007
  0:    rdx 0x0000000000000000, rsp 0x00007ffdcc751b10, rbp 0x0000000000000000
  0:    rsi 0x0000000012bb0e98, rdi 0x0000000012bb09c0, r8  0x0000000012bb0e60
  0:    r9  0x0000000000000003, r10 0x0000000012a31010, r11 0x0000000000000006
  0:    r12 0x3ff0000000000000, r13 0x00007ffdcc752750, r14 0x00007ffdcc751b58
  0:    r15 0x00007ffdcc751b70
  0:   /lib64/libpthread.so.0(+0x16910) [0x1499ad66c910]
  0:   /opt/cray/pe/hdf5-parallel/1.14.3.1/nvidia/23.3/lib/libhdf5_parallel_nvidia.so.310(H5T__init_native_float_types+0xc34) [0x1499ac8bbbb4]
  0:   /opt/cray/pe/hdf5-parallel/1.14.3.1/nvidia/23.3/lib/libhdf5_parallel_nvidia.so.310(H5T_init+0x1a9) [0x1499ac80ef69]
  0:   /opt/cray/pe/hdf5-parallel/1.14.3.1/nvidia/23.3/lib/libhdf5_parallel_nvidia.so.310(H5VL_init_phase2+0x15d) [0x1499ac8de1dd]
  0:   /opt/cray/pe/hdf5-parallel/1.14.3.1/nvidia/23.3/lib/libhdf5_parallel_nvidia.so.310(H5_init_library+0x350) [0x1499ac587f90]
  0:   /opt/cray/pe/hdf5-parallel/1.14.3.1/nvidia/23.3/lib/libhdf5_parallel_nvidia.so.310(H5Eset_auto2+0x42) [0x1499ac6479c2]
  0:   /opt/cray/pe/netcdf-hdf5parallel/4.9.0.13/nvidia/23.3/lib/libnetcdf_parallel_nvidia.so.19(+0x9be1b) [0x1499acaabe1b]
  0:   /opt/cray/pe/netcdf-hdf5parallel/4.9.0.13/nvidia/23.3/lib/libnetcdf_parallel_nvidia.so.19(nc4_hdf5_initialize+0xa) [0x1499acaaab8a]
  0:   /opt/cray/pe/netcdf-hdf5parallel/4.9.0.13/nvidia/23.3/lib/libnetcdf_parallel_nvidia.so.19(NC_HDF5_initialize+0x23) [0x1499acab5f63]
  0:   /opt/cray/pe/netcdf-hdf5parallel/4.9.0.13/nvidia/23.3/lib/libnetcdf_parallel_nvidia.so.19(nc_initialize+0x58) [0x1499aca377d8]
  0:   /opt/cray/pe/netcdf-hdf5parallel/4.9.0.13/nvidia/23.3/lib/libnetcdf_parallel_nvidia.so.19(NC_open+0x103) [0x1499aca3bb43]
  0:   /opt/cray/pe/netcdf-hdf5parallel/4.9.0.13/nvidia/23.3/lib/libnetcdf_parallel_nvidia.so.19(nc_open+0x25) [0x1499aca3ade5]
  0:   /opt/cray/pe/netcdf-hdf5parallel/4.9.0.13/nvidia/23.3/lib/libnetcdff_parallel_nvidia.so.7(nf_open_+0x14c) [0x1499acc0d98c]
  0:   /opt/cray/pe/netcdf-hdf5parallel/4.9.0.13/nvidia/23.3/lib/libnetcdff_parallel_nvidia.so.7(netcdf_nf90_open_+0x154) [0x1499acc33e54]
  0:   /mscratch/sd/n/ndk/e3sm_scratch/muller-cpu/ERS_D.f09_f09.IELM.muller-cpu_nvidia.elm-koch_snowflake.pr7867-allfortflagssame/bld/e3sm.exe(shr_stream_mod_shr_stream_getcalendar_: shr_stream_mod_shr_stream_\
getcalendar_ at /global/cfs/cdirs/e3sm/ndk/repos/c31-nov6/share/streams/shr_stream_mod.F90:1949) [0x25ef010]
  0:   /mscratch/sd/n/ndk/e3sm_scratch/muller-cpu/ERS_D.f09_f09.IELM.muller-cpu_nvidia.elm-koch_snowflake.pr7867-allfortflagssame/bld/e3sm.exe(shr_stream_mod_shr_stream_init_: shr_stream_mod_shr_stream_init_ a\
t /global/cfs/cdirs/e3sm/ndk/repos/c31-nov6/share/streams/shr_stream_mod.F90:624) [0x25dfffb]
  0:   /mscratch/sd/n/ndk/e3sm_scratch/muller-cpu/ERS_D.f09_f09.IELM.muller-cpu_nvidia.elm-koch_snowflake.pr7867-allfortflagssame/bld/e3sm.exe(shr_strdata_mod_shr_strdata_readnml_: shr_strdata_mod_shr_strdata_\
readnml_ at /global/cfs/cdirs/e3sm/ndk/repos/c31-nov6/share/streams/shr_strdata_mod.F90:1429) [0x25a67cb]
  0:   /mscratch/sd/n/ndk/e3sm_scratch/muller-cpu/ERS_D.f09_f09.IELM.muller-cpu_nvidia.elm-koch_snowflake.pr7867-allfortflagssame/bld/e3sm.exe(datm_shr_mod_datm_shr_read_namelists_: datm_shr_mod_datm_shr_read_\
namelists_ at /global/cfs/cdirs/e3sm/ndk/repos/c31-nov6/components/data_comps/datm/src/datm_shr_mod.F90:147) [0xa5626f]
  0:   /mscratch/sd/n/ndk/e3sm_scratch/muller-cpu/ERS_D.f09_f09.IELM.muller-cpu_nvidia.elm-koch_snowflake.pr7867-allfortflagssame/bld/e3sm.exe(atm_comp_mct_atm_init_mct_: atm_comp_mct_atm_init_mct_ at /global/\
cfs/cdirs/e3sm/ndk/repos/c31-nov6/components/data_comps/datm/src/atm_comp_mct.F90:153) [0xa36d34]
  0:   /mscratch/sd/n/ndk/e3sm_scratch/muller-cpu/ERS_D.f09_f09.IELM.muller-cpu_nvidia.elm-koch_snowflake.pr7867-allfortflagssame/bld/e3sm.exe(component_mod_component_init_cc_: component_mod_component_init_cc_\
 at /global/cfs/cdirs/e3sm/ndk/repos/c31-nov6/driver-mct/main/component_mod.F90:259) [0x848b80]
  0:   /mscratch/sd/n/ndk/e3sm_scratch/muller-cpu/ERS_D.f09_f09.IELM.muller-cpu_nvidia.elm-koch_snowflake.pr7867-allfortflagssame/bld/e3sm.exe(cime_comp_mod_cime_init_: cime_comp_mod_cime_init_ at /global/cfs/\
cdirs/e3sm/ndk/repos/c31-nov6/driver-mct/main/cime_comp_mod.F90:1518) [0x80f9b1]
  0:   /mscratch/sd/n/ndk/e3sm_scratch/muller-cpu/ERS_D.f09_f09.IELM.muller-cpu_nvidia.elm-koch_snowflake.pr7867-allfortflagssame/bld/e3sm.exe(MAIN_: MAIN_ at /global/cfs/cdirs/e3sm/ndk/repos/c31-nov6/driver-m\
ct/main/cime_driver.F90:124) [0x845b0b]
  0:   /mscratch/sd/n/ndk/e3sm_scratch/muller-cpu/ERS_D.f09_f09.IELM.muller-cpu_nvidia.elm-koch_snowflake.pr7867-allfortflagssame/bld/e3sm.exe(main+0x31) [0x8006f1]
  0:   /lib64/libc.so.6(__libc_start_main+0xef) [0x1499a643e1fd]
  0:   /mscratch/sd/n/ndk/e3sm_scratch/muller-cpu/ERS_D.f09_f09.IELM.muller-cpu_nvidia.elm-koch_snowflake.pr7867-allfortflagssame/bld/e3sm.exe(_start: _start at /home/abuild/rpmbuild/BUILD/glibc-2.31/csu/../sy\
sdeps/x86_64/start.S:122) [0x8005da]
srun: error: nid001003: task 0: Exited with exit code 127

This might be pointing to a place where we are reading in a file and perhaps could be a clue.
Yep:

rCode = nf90_open(fileName,nf90_nowrite,fid

Adding write statement (and flush), I can see the file it is trying to read. Which does not seem to have issue.

(e3sm_unified_1.11.1_login) muller-login04% ncdump -k /global/cfs/cdirs/e3sm/inputdata/atm/datm7/atm_forcing.datm7.Qian.T62.c080727/Solar6Hrly/clmforc.Qian.c2006.T62.Solr.1972-01.nc
classic

but it sounds to me like the error much be happening in the nf90_open call itself, which we know from other issues I've debugged, does have a problem. Oh wait -- I know what I did wrong, hold on. OK fixed it -- I had been experimenting with taking out the hack we put in place for this very issue for a different reason. OK, with LD_LIB hack in place (as it is in our repo), and then adding the same fortran flags we currently use for e3sm sources to also build other sources (csm, scorpio), I can get this test to avoid a fault and pass.

Just for completeness:

muller-login02% ./ieee-build.sh 
gnu
gnu with -ffpe-trap=invalid,zero,overflow
gnu with -O3 -ffpe-trap=invalid,zero,overflow
intel
intel with -fpe0
intel with -O3 -fpe0
nvidia
test-procs-mod.F90:
ieee-nvidia-inexact.F90:
nvidia with -O0 -g -Ktrap=fp -Mbounds -Kieee
test-procs-mod.F90:
ieee-nvidia-inexact.F90:
nvidia with -i4 -Mstack_arrays  -Mextend -byteswapio -Mflushz -Kieee -Mallocatable=03 -traceback  -O0 -g -Ktrap=fp -Mbounds -Kieee  -Mfree
test-procs-mod.F90:
ieee-nvidia-inexact.F90:
nvidia with -O2 -g -Ktrap=fp -Kieee
test-procs-mod.F90:
ieee-nvidia-inexact.F90:
test-procs-mod.F90:
ieee-nvidia-inexact.F90:
./ieee-build.sh: line 60: 2184875 Floating point exception./a.out > o.nvidia-opttrap.txt
nvidia with -O2 -g -Ktrap=fp -Mbounds -Kieee
test-procs-mod.F90:
ieee-nvidia-inexact.F90:
test-procs-mod.F90:
ieee-nvidia-inexact.F90:
2.493u 1.017s 0:04.02 87.0% 0pf+0w

With 

program test

   use, intrinsic :: ieee_arithmetic  ! use this with ieee_usual
   !use, intrinsic :: ieee_exceptions ! use this for IEEE_DIVIDE_BY_ZERO
   use procs
   implicit none
   logical :: halt_inexact, inv, divz, ovf, unf, inx
   integer :: j
   integer, parameter :: nlevgrnd=6
   real(8) :: zsoi(nlevgrnd), scalez, zecoeff
   double precision :: a(33),b

   scalez=0.25
   zecoeff=0.25

   call ieee_get_flag(ieee_invalid,         inv)
   call ieee_get_flag(ieee_divide_by_zero,  divz)
   call ieee_get_flag(ieee_overflow,        ovf)
   call ieee_get_flag(ieee_underflow,       unf)
   call ieee_get_flag(ieee_inexact,         inx)
   write(*,*) 'Flags initial: inv=',inv,' divz=',divz,' ovf=',ovf,' unf=',unf,' inx=',inx

   call ieee_get_halting_mode(ieee_inexact, halt_inexact)
   print *, "halt_inexact AT THE TOP: ",halt_inexact

   call ieee_set_flag(ieee_all, .false.)
   do j = 1, nlevgrnd
      zsoi(j) = scalez*(exp(zecoeff*(dble(j)-0.5))-1.0)    !node depths
      call ieee_get_flag(ieee_invalid,         inv)
      call ieee_get_flag(ieee_divide_by_zero,  divz)
      call ieee_get_flag(ieee_overflow,        ovf)
      call ieee_get_flag(ieee_underflow,       unf)
      call ieee_get_flag(ieee_inexact,         inx)
      write(*,*) 'Flags after exp: inv=',inv,' divz=',divz,' ovf=',ovf,' unf=',unf,' inx=',inx
   enddo

   a = &
      [0.3d0,7d11,7d9,4d0,1d0,1d3,1d0,0.2d0, &
      9d11,1d10,5d0,1d3,1d3,1d0,1d3,1d5,2d9, &
      5d0,1d0,8d-1,9d1,4d2,5d-1,5d-1,1d-1,8d6, &
      1d0,0d0,0d0,0d0,0d0,0d0,0d0]
   print*,'check 1'
   call flush(6)
   call sub1(a,b)
   print*,'check 2'
   call flush(6)
end program test

and the module as noted in the nvidia forum

peterdschwartz · 2025-11-13T15:30:02Z

I think all signs point to it being an issue in the compiler mistakenly turning on halting on inexact rather than the codebase, and I don't think it would be worth the effort to try and figure out what part of code gen process is misbehaving.

There is really no use case for trapping ieee_inexact, so I'm happy with the solution I gave: it's clear, robust, and it ensures ieee_inexact is not trapped which is what we need.

If you want to keep exploring, go ahead but there's no reason to hold up this PR while you do that. There are potentially real ERS bugs with nvidia that need to be investigated.

Use ieee intrinsics to avoid trapping ieee_inexact signals

f5a16d1

peterdschwartz requested a review from ndkeen November 7, 2025 18:01

peterdschwartz added 2 commits November 7, 2025 14:52

fix c/p mistake

57a57af

add check in case compiler doesn't support inexact halting

1e05fe5

rljacob reviewed Nov 10, 2025

View reviewed changes

Use ieee intrinsics to avoid trapping ieee_inexact signals #7867

Are you sure you want to change the base?

Use ieee intrinsics to avoid trapping ieee_inexact signals #7867

Conversation

peterdschwartz commented Nov 7, 2025

Uh oh!

peterdschwartz commented Nov 7, 2025

Uh oh!

ndkeen commented Nov 7, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

peterdschwartz commented Nov 10, 2025

Uh oh!

ndkeen commented Nov 10, 2025

Uh oh!

rljacob Nov 10, 2025

Choose a reason for hiding this comment

Uh oh!

rljacob Nov 10, 2025

Choose a reason for hiding this comment

Uh oh!

peterdschwartz Nov 10, 2025

Choose a reason for hiding this comment

Uh oh!

peterdschwartz Nov 10, 2025

Choose a reason for hiding this comment

Uh oh!

rljacob Nov 10, 2025

Choose a reason for hiding this comment

Uh oh!

peterdschwartz Nov 10, 2025

Choose a reason for hiding this comment

Uh oh!

peterdschwartz Nov 10, 2025

Choose a reason for hiding this comment

Uh oh!

peterdschwartz commented Nov 10, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ndkeen commented Nov 12, 2025

Uh oh!

peterdschwartz commented Nov 12, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ndkeen commented Nov 12, 2025

Uh oh!

peterdschwartz commented Nov 12, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ndkeen commented Nov 12, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

peterdschwartz commented Nov 13, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

ndkeen commented Nov 7, 2025 •

edited

Loading

peterdschwartz commented Nov 10, 2025 •

edited

Loading

peterdschwartz commented Nov 12, 2025 •

edited

Loading

peterdschwartz commented Nov 12, 2025 •

edited

Loading

ndkeen commented Nov 12, 2025 •

edited

Loading