Skip to content

Conversation

@peterdschwartz
Copy link
Contributor

Potentially due to a compiler bug in nvhpc, ELM is halting on ieee_inexact values when Ktrap=fp mode is set.

@peterdschwartz peterdschwartz requested a review from ndkeen November 7, 2025 18:01
@peterdschwartz
Copy link
Contributor Author

going to perform sanity checks on chrysalis

@ndkeen
Copy link
Contributor

ndkeen commented Nov 7, 2025

Noting a reproducer: ERS_D.f09_f09.IELM.pm-cpu_nvidia.elm-koch_snowflake

As it looks to only be a problem with nvidia compiler, it might be good to put this test under

#ifdef CPRNVIDIA

?

@peterdschwartz
Copy link
Contributor Author

Adding ifdefs is unnecessary: trapping inexact is always off for all compilers and machines, including pm-cpu_nvidia. This workaround appears necessary due to a bug with nvidia's math runtime. If someone wanted to turn on traps for inexact, the correct way would be to use the ieee intrinsics around the code of interest rather than compile flags.

@ndkeen
Copy link
Contributor

ndkeen commented Nov 10, 2025

We are unable to reproduce this behavior with a simple example.

I tried simply:

       do j = 1, nlevgrnd
          exponent_arg = zecoeff * (j-0.5_r8)
          IF ( exponent_arg > 700.0_r8 ) THEN
             exponent_arg = 700.0_r8
          END IF
          !write(*,'(a,i8,es20.10,es20.10)') " ndk j, exponent_arg, scalez=", j, exponent_arg, scalez                                                                                                                                                                         
          !write(*,'(a,i8,es20.10)') " ndk j, exp(exponent_arg)", j, exp(exponent_arg)                                                                                                                                                                                        
          zsoi(j) = scalez*(exp(exponent_arg)-1._r8)

          !ndk zsoi(j) = scalez*(exp(zecoeff*(j-0.5_r8))-1._r8)    !node depths                                                                                                                                                                                               
       enddo

in both places where we see the error and the test passes.

So I'm not sure what's best to do here

if (.not. readvar ) then
! Variable ZSOI not found, use the ELM parameters.
if (ieee_support_halting(ieee_inexact)) then
call ieee_set_flag(ieee_all,.false.)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This line turns off ALL floating point trapping which would override our DEBUG settings. This would have to be wrapped in an #ifdef CPRNIVIDIA if you really want this.

! Variable ZSOI not found, use the ELM parameters.
if (ieee_support_halting(ieee_inexact)) then
call ieee_set_flag(ieee_all,.false.)
call ieee_set_halting_mode(ieee_inexact, .false.)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This will turn off ieee_inexact trapping for everything that runs after this making it a global setting, not just for the land (unless the land is running on its own tasks). If we want to set these things globally, it should be done in the driver. Or you should set it back to "on".

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ieee_inexact trappings are already off for everything. None of our code could work if not the case.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is essentially a no-op except for this nvida on perlmutter, which i can only hazard is a bug in the compiler or runtime.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh because ieee_support_halting(ieee_inexact) will be False everywhere except NVIDIA?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it's even false with nvidia ! From the nvidia docs our current flag only captures inv, divz, ovf. iexact is really never used because all transcendental functions are inexact. I have verified this by calling ieee_get_halting_mode(ieee_inexact, halt) which is .false. wherever i call it.

The special flag -Ktrap=none is used to preserve FPEs during compilation without unmasking any of them at runtime.

The inv, divz, and ovf flags are often the most interesting, as they signify abnormal floating-point behavior in the program. These can be enabled with the useful shorthand -Ktrap=fp.

it seems to be a bug in the code gen or AVX2 exp runtime. Found a similar issue of random inexact being tripped in 25.9 https://forums.developer.nvidia.com/t/nvfortran-25-9-spurious-floating-point-exception/346604

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sorry not the support halting but the actual halting flag is false

@peterdschwartz
Copy link
Contributor Author

peterdschwartz commented Nov 10, 2025

@rljacob For completeness here are the debugging statements i tested to confirm what's happening (to the best of my knowledge at least)

! TOP OF SUBROUTINE
    block
      logical :: halt_inexact

      call ieee_get_halting_mode(ieee_inexact, halt_inexact)
      print *, "halt_inexact AT THE TOP: ",halt_inexact

    end block
......
block
         logical :: inv, divz, ovf, unf, inx
         call ieee_set_flag(ieee_all, .false.)
         do j = 1, nlevgrnd
            zsoi(j) = scalez*(exp(zecoeff*(dble(j)-0.5_r8))-1._r8)    !node depths
            call ieee_get_flag(ieee_invalid,         inv)
            call ieee_get_flag(ieee_divide_by_zero,  divz)
            call ieee_get_flag(ieee_overflow,        ovf)
            call ieee_get_flag(ieee_underflow,       unf)
            call ieee_get_flag(ieee_inexact,         inx)
            write(*,*) 'Flags after exp: inv=',inv,' divz=',divz,' ovf=',ovf,' unf=',unf,' inx=',inx
         enddo
      end block

output:

halt_inexact AT THE TOP:   F
 Flags after exp: inv=  F  divz=  F  ovf=  F  unf=  F  inx=  T
 Flags after exp: inv=  F  divz=  F  ovf=  F  unf=  F  inx=  T
 Flags after exp: inv=  F  divz=  F  ovf=  F  unf=  F  inx=  T
 Flags after exp: inv=  F  divz=  F  ovf=  F  unf=  F  inx=  T
 Flags after exp: inv=  F  divz=  F  ovf=  F  unf=  F  inx=  T
 Flags after exp: inv=  F  divz=  F  ovf=  F  unf=  F  inx=  T
 Flags after exp: inv=  F  divz=  F  ovf=  F  unf=  F  inx=  T
 Flags after exp: inv=  F  divz=  F  ovf=  F  unf=  F  inx=  T
 Flags after exp: inv=  F  divz=  F  ovf=  F  unf=  F  inx=  T
 Flags after exp: inv=  F  divz=  F  ovf=  F  unf=  F  inx=  T
 Flags after exp: inv=  F  divz=  F  ovf=  F  unf=  F  inx=  T
 Flags after exp: inv=  F  divz=  F  ovf=  F  unf=  F  inx=  T
 Flags after exp: inv=  F  divz=  F  ovf=  F  unf=  F  inx=  T
 Flags after exp: inv=  F  divz=  F  ovf=  F  unf=  F  inx=  T
 Flags after exp: inv=  F  divz=  F  ovf=  F  unf=  F  inx=  T

@ndkeen
Copy link
Contributor

ndkeen commented Nov 12, 2025

What are we trying to show regarding the inexact flag?

muller-login02% cat ieee-nvidia-inexact.F90
! ftn -fpp -fpe0 ieee-nvidia-inexact.F90
program test

   use, intrinsic :: ieee_arithmetic  ! use this with ieee_usual
   !use, intrinsic :: ieee_exceptions ! use this for IEEE_DIVIDE_BY_ZERO
   implicit none
   logical :: halt_inexact, inv, divz, ovf, unf, inx
   integer :: j
   integer, parameter :: nlevgrnd=6
   real(8) :: zsoi(nlevgrnd), scalez, zecoeff

   scalez=0.25
   zecoeff=0.25

   call ieee_get_flag(ieee_invalid,         inv)
   call ieee_get_flag(ieee_divide_by_zero,  divz)
   call ieee_get_flag(ieee_overflow,        ovf)
   call ieee_get_flag(ieee_underflow,       unf)
   call ieee_get_flag(ieee_inexact,         inx)
   write(*,*) 'Flags initial: inv=',inv,' divz=',divz,' ovf=',ovf,' unf=',unf,' inx=',inx

   call ieee_get_halting_mode(ieee_inexact, halt_inexact)
   print *, "halt_inexact AT THE TOP: ",halt_inexact


   call ieee_set_flag(ieee_all, .false.)
   do j = 1, nlevgrnd
      zsoi(j) = scalez*(exp(zecoeff*(dble(j)-0.5))-1.0)    !node depths
      call ieee_get_flag(ieee_invalid,         inv)
      call ieee_get_flag(ieee_divide_by_zero,  divz)
      call ieee_get_flag(ieee_overflow,        ovf)
      call ieee_get_flag(ieee_underflow,       unf)
      call ieee_get_flag(ieee_inexact,         inx)
      write(*,*) 'Flags after exp: inv=',inv,' divz=',divz,' ovf=',ovf,' unf=',unf,' inx=',inx
   enddo

end program test

muller-login02% cat ieee-build.sh 
#!/usr/bin/env bash

source /usr/share/lmod/lmod/init/bash
module load cpu

echo "gnu"
module -q load PrgEnv-gnu
ftn ieee-nvidia-inexact.F90
./a.out > o.gnu.txt

echo "gnu with -ffpe-trap=invalid,zero,overflow"
module -q load PrgEnv-gnu
ftn -g -Wall -fbacktrace -fcheck=bounds -ffpe-trap=invalid,zero,overflow ieee-nvidia-inexact.F90
./a.out > o.gnu-trap.txt



echo "intel"
module -q load PrgEnv-intel
ftn -diag-disable=10448 ieee-nvidia-inexact.F90
./a.out > o.intel.txt

echo "intel with -fpe0"
module -q load PrgEnv-intel
ftn -O0 -g -check uninit -check bounds -check pointers -fpe0 -check noarg_temp_created -init=snan,arrays -diag-disable=10448 ieee-nvidia-inexact.F90
./a.out > o.intel-trap.txt



echo "nvidia"
module -q load PrgEnv-nvidia
ftn -Wl,-z,noexecstack ieee-nvidia-inexact.F90
./a.out > o.nvida.txt

echo "nvidia with -Ktrap=fp"
module -q load PrgEnv-nvidia
ftn -O0 -g -Ktrap=fp -Mbounds -Kieee -Wl,-z,noexecstack ieee-nvidia-inexact.F90
./a.out > o.nvidia-trap.txt

muller-login02% ./ieee-build.sh 
gnu
gnu with -ffpe-trap=invalid,zero,overflow
intel
intel with -fpe0
nvidia
nvidia with -Ktrap=fp
1.334u 0.493s 0:01.85 98.3% 0pf+0w


muller-login02% grep Flag o.* | column -t
o.gnu-trap.txt:     Flags  initial:  inv=  F     divz=  F      ovf=  F     unf=  F     inx=  F     
o.gnu-trap.txt:     Flags  after     exp:  inv=  F      divz=  F     ovf=  F     unf=  F     inx=  T
o.gnu-trap.txt:     Flags  after     exp:  inv=  F      divz=  F     ovf=  F     unf=  F     inx=  T
o.gnu-trap.txt:     Flags  after     exp:  inv=  F      divz=  F     ovf=  F     unf=  F     inx=  T
o.gnu-trap.txt:     Flags  after     exp:  inv=  F      divz=  F     ovf=  F     unf=  F     inx=  T
o.gnu-trap.txt:     Flags  after     exp:  inv=  F      divz=  F     ovf=  F     unf=  F     inx=  T
o.gnu-trap.txt:     Flags  after     exp:  inv=  F      divz=  F     ovf=  F     unf=  F     inx=  T
o.gnu.txt:          Flags  initial:  inv=  F     divz=  F      ovf=  F     unf=  F     inx=  F     
o.gnu.txt:          Flags  after     exp:  inv=  F      divz=  F     ovf=  F     unf=  F     inx=  T
o.gnu.txt:          Flags  after     exp:  inv=  F      divz=  F     ovf=  F     unf=  F     inx=  T
o.gnu.txt:          Flags  after     exp:  inv=  F      divz=  F     ovf=  F     unf=  F     inx=  T
o.gnu.txt:          Flags  after     exp:  inv=  F      divz=  F     ovf=  F     unf=  F     inx=  T
o.gnu.txt:          Flags  after     exp:  inv=  F      divz=  F     ovf=  F     unf=  F     inx=  T
o.gnu.txt:          Flags  after     exp:  inv=  F      divz=  F     ovf=  F     unf=  F     inx=  T
o.intel-trap.txt:   Flags  initial:  inv=  F     divz=  F      ovf=  F     unf=  F     inx=  F     
o.intel-trap.txt:   Flags  after     exp:  inv=  F      divz=  F     ovf=  F     unf=  F     inx=  T
o.intel-trap.txt:   Flags  after     exp:  inv=  F      divz=  F     ovf=  F     unf=  F     inx=  T
o.intel-trap.txt:   Flags  after     exp:  inv=  F      divz=  F     ovf=  F     unf=  F     inx=  T
o.intel-trap.txt:   Flags  after     exp:  inv=  F      divz=  F     ovf=  F     unf=  F     inx=  T
o.intel-trap.txt:   Flags  after     exp:  inv=  F      divz=  F     ovf=  F     unf=  F     inx=  T
o.intel-trap.txt:   Flags  after     exp:  inv=  F      divz=  F     ovf=  F     unf=  F     inx=  T
o.intel.txt:        Flags  initial:  inv=  F     divz=  F      ovf=  F     unf=  F     inx=  F     
o.intel.txt:        Flags  after     exp:  inv=  F      divz=  F     ovf=  F     unf=  F     inx=  T
o.intel.txt:        Flags  after     exp:  inv=  F      divz=  F     ovf=  F     unf=  F     inx=  T
o.intel.txt:        Flags  after     exp:  inv=  F      divz=  F     ovf=  F     unf=  F     inx=  T
o.intel.txt:        Flags  after     exp:  inv=  F      divz=  F     ovf=  F     unf=  F     inx=  T
o.intel.txt:        Flags  after     exp:  inv=  F      divz=  F     ovf=  F     unf=  F     inx=  T
o.intel.txt:        Flags  after     exp:  inv=  F      divz=  F     ovf=  F     unf=  F     inx=  T
o.nvida.txt:        Flags  initial:  inv=  F     divz=  F      ovf=  F     unf=  F     inx=  F     
o.nvida.txt:        Flags  after     exp:  inv=  F      divz=  F     ovf=  F     unf=  F     inx=  T
o.nvida.txt:        Flags  after     exp:  inv=  F      divz=  F     ovf=  F     unf=  F     inx=  T
o.nvida.txt:        Flags  after     exp:  inv=  F      divz=  F     ovf=  F     unf=  F     inx=  T
o.nvida.txt:        Flags  after     exp:  inv=  F      divz=  F     ovf=  F     unf=  F     inx=  T
o.nvida.txt:        Flags  after     exp:  inv=  F      divz=  F     ovf=  F     unf=  F     inx=  T
o.nvida.txt:        Flags  after     exp:  inv=  F      divz=  F     ovf=  F     unf=  F     inx=  T
o.nvidia-trap.txt:  Flags  initial:  inv=  F     divz=  F      ovf=  F     unf=  F     inx=  F     
o.nvidia-trap.txt:  Flags  after     exp:  inv=  F      divz=  F     ovf=  F     unf=  F     inx=  T
o.nvidia-trap.txt:  Flags  after     exp:  inv=  F      divz=  F     ovf=  F     unf=  F     inx=  T
o.nvidia-trap.txt:  Flags  after     exp:  inv=  F      divz=  F     ovf=  F     unf=  F     inx=  T
o.nvidia-trap.txt:  Flags  after     exp:  inv=  F      divz=  F     ovf=  F     unf=  F     inx=  T
o.nvidia-trap.txt:  Flags  after     exp:  inv=  F      divz=  F     ovf=  F     unf=  F     inx=  T
o.nvidia-trap.txt:  Flags  after     exp:  inv=  F      divz=  F     ovf=  F     unf=  F     inx=  T

@peterdschwartz
Copy link
Contributor Author

peterdschwartz commented Nov 12, 2025

@ndkeen
o.gnu-trap.txt: Flags after exp: inv= F divz= F ovf= F unf= F inx= T

this shows that the only ieee flag being set from the exp is the ieee_inexact flag. This proves that the program isn't catching an ieee_invalid, ieee_overflow, or any of the others. Note that every call to exp (and any other math transcendental function) sets the ieee_inexact flag, which is why compilers do not trap it by default and some don't even allow for trapping it on certain architectures.

You did not show the output from this line:

   call ieee_get_halting_mode(ieee_inexact, halt_inexact)
   print *, "halt_inexact AT THE TOP: ",halt_inexact

But if you did, that would tell you if the program is set to halt (ie trap) the ieee_inexact. You should find that none of those compilers with those flags do, or if they did, the program would halt with a SIGFPE

@ndkeen
Copy link
Contributor

ndkeen commented Nov 12, 2025

muller-login02% grep halt o.* | column -t
o.gnu-trap.txt:     halt_inexact  AT  THE  TOP:  F
o.gnu.txt:          halt_inexact  AT  THE  TOP:  F
o.intel-trap.txt:   halt_inexact  AT  THE  TOP:  F
o.intel.txt:        halt_inexact  AT  THE  TOP:  F
o.nvida.txt:        halt_inexact  AT  THE  TOP:  F
o.nvidia-trap.txt:  halt_inexact  AT  THE  TOP:  F

you are correct, none have it.

But while its odd that inexact is showing T, it's not clear why nvidia would behave differently than others?
I mean, can we modify this test to show?

@peterdschwartz
Copy link
Contributor Author

peterdschwartz commented Nov 12, 2025

It's not odd to show inexact = T after an exp. Based on this thread https://forums.developer.nvidia.com/t/nvfortran-25-9-spurious-floating-point-exception/346604

Someone was able to make a small reproducer with 25.9 and with -O1 flag. It may be worth simply copying that code and seeing if we get the same results. But Matt Colgrove confirms that it shouldn't be happening and is a compiler bug.

@ndkeen
Copy link
Contributor

ndkeen commented Nov 12, 2025

OK, using that test code provided, I was able to get a signal. If I build with debug flags and -O2, I can get a FP exception (only with nvidia compiler). However, with e3sm, we arent building with -O2. Also, for test problem, I can get the error to go away (as suggested) with -Mnoinline, which is actually a reasonable thing to add to our e3sm DEBUG flags. I tried this, but I get the error (although different location). I then see that certain libs (outside of e3sm.bldlog) are for some reason not using all of the fortran debug flags specified. When I add them to be consistent across all fortran builds, I still get the error.

  0: Error: floating point exception, floating point invalid operation
  0:    rax 0x0000000000000080, rbx 0x00007ffdcc751cc0, rcx 0x0000000000000007
  0:    rdx 0x0000000000000000, rsp 0x00007ffdcc751b10, rbp 0x0000000000000000
  0:    rsi 0x0000000012bb0e98, rdi 0x0000000012bb09c0, r8  0x0000000012bb0e60
  0:    r9  0x0000000000000003, r10 0x0000000012a31010, r11 0x0000000000000006
  0:    r12 0x3ff0000000000000, r13 0x00007ffdcc752750, r14 0x00007ffdcc751b58
  0:    r15 0x00007ffdcc751b70
  0:   /lib64/libpthread.so.0(+0x16910) [0x1499ad66c910]
  0:   /opt/cray/pe/hdf5-parallel/1.14.3.1/nvidia/23.3/lib/libhdf5_parallel_nvidia.so.310(H5T__init_native_float_types+0xc34) [0x1499ac8bbbb4]
  0:   /opt/cray/pe/hdf5-parallel/1.14.3.1/nvidia/23.3/lib/libhdf5_parallel_nvidia.so.310(H5T_init+0x1a9) [0x1499ac80ef69]
  0:   /opt/cray/pe/hdf5-parallel/1.14.3.1/nvidia/23.3/lib/libhdf5_parallel_nvidia.so.310(H5VL_init_phase2+0x15d) [0x1499ac8de1dd]
  0:   /opt/cray/pe/hdf5-parallel/1.14.3.1/nvidia/23.3/lib/libhdf5_parallel_nvidia.so.310(H5_init_library+0x350) [0x1499ac587f90]
  0:   /opt/cray/pe/hdf5-parallel/1.14.3.1/nvidia/23.3/lib/libhdf5_parallel_nvidia.so.310(H5Eset_auto2+0x42) [0x1499ac6479c2]
  0:   /opt/cray/pe/netcdf-hdf5parallel/4.9.0.13/nvidia/23.3/lib/libnetcdf_parallel_nvidia.so.19(+0x9be1b) [0x1499acaabe1b]
  0:   /opt/cray/pe/netcdf-hdf5parallel/4.9.0.13/nvidia/23.3/lib/libnetcdf_parallel_nvidia.so.19(nc4_hdf5_initialize+0xa) [0x1499acaaab8a]
  0:   /opt/cray/pe/netcdf-hdf5parallel/4.9.0.13/nvidia/23.3/lib/libnetcdf_parallel_nvidia.so.19(NC_HDF5_initialize+0x23) [0x1499acab5f63]
  0:   /opt/cray/pe/netcdf-hdf5parallel/4.9.0.13/nvidia/23.3/lib/libnetcdf_parallel_nvidia.so.19(nc_initialize+0x58) [0x1499aca377d8]
  0:   /opt/cray/pe/netcdf-hdf5parallel/4.9.0.13/nvidia/23.3/lib/libnetcdf_parallel_nvidia.so.19(NC_open+0x103) [0x1499aca3bb43]
  0:   /opt/cray/pe/netcdf-hdf5parallel/4.9.0.13/nvidia/23.3/lib/libnetcdf_parallel_nvidia.so.19(nc_open+0x25) [0x1499aca3ade5]
  0:   /opt/cray/pe/netcdf-hdf5parallel/4.9.0.13/nvidia/23.3/lib/libnetcdff_parallel_nvidia.so.7(nf_open_+0x14c) [0x1499acc0d98c]
  0:   /opt/cray/pe/netcdf-hdf5parallel/4.9.0.13/nvidia/23.3/lib/libnetcdff_parallel_nvidia.so.7(netcdf_nf90_open_+0x154) [0x1499acc33e54]
  0:   /mscratch/sd/n/ndk/e3sm_scratch/muller-cpu/ERS_D.f09_f09.IELM.muller-cpu_nvidia.elm-koch_snowflake.pr7867-allfortflagssame/bld/e3sm.exe(shr_stream_mod_shr_stream_getcalendar_: shr_stream_mod_shr_stream_\
getcalendar_ at /global/cfs/cdirs/e3sm/ndk/repos/c31-nov6/share/streams/shr_stream_mod.F90:1949) [0x25ef010]
  0:   /mscratch/sd/n/ndk/e3sm_scratch/muller-cpu/ERS_D.f09_f09.IELM.muller-cpu_nvidia.elm-koch_snowflake.pr7867-allfortflagssame/bld/e3sm.exe(shr_stream_mod_shr_stream_init_: shr_stream_mod_shr_stream_init_ a\
t /global/cfs/cdirs/e3sm/ndk/repos/c31-nov6/share/streams/shr_stream_mod.F90:624) [0x25dfffb]
  0:   /mscratch/sd/n/ndk/e3sm_scratch/muller-cpu/ERS_D.f09_f09.IELM.muller-cpu_nvidia.elm-koch_snowflake.pr7867-allfortflagssame/bld/e3sm.exe(shr_strdata_mod_shr_strdata_readnml_: shr_strdata_mod_shr_strdata_\
readnml_ at /global/cfs/cdirs/e3sm/ndk/repos/c31-nov6/share/streams/shr_strdata_mod.F90:1429) [0x25a67cb]
  0:   /mscratch/sd/n/ndk/e3sm_scratch/muller-cpu/ERS_D.f09_f09.IELM.muller-cpu_nvidia.elm-koch_snowflake.pr7867-allfortflagssame/bld/e3sm.exe(datm_shr_mod_datm_shr_read_namelists_: datm_shr_mod_datm_shr_read_\
namelists_ at /global/cfs/cdirs/e3sm/ndk/repos/c31-nov6/components/data_comps/datm/src/datm_shr_mod.F90:147) [0xa5626f]
  0:   /mscratch/sd/n/ndk/e3sm_scratch/muller-cpu/ERS_D.f09_f09.IELM.muller-cpu_nvidia.elm-koch_snowflake.pr7867-allfortflagssame/bld/e3sm.exe(atm_comp_mct_atm_init_mct_: atm_comp_mct_atm_init_mct_ at /global/\
cfs/cdirs/e3sm/ndk/repos/c31-nov6/components/data_comps/datm/src/atm_comp_mct.F90:153) [0xa36d34]
  0:   /mscratch/sd/n/ndk/e3sm_scratch/muller-cpu/ERS_D.f09_f09.IELM.muller-cpu_nvidia.elm-koch_snowflake.pr7867-allfortflagssame/bld/e3sm.exe(component_mod_component_init_cc_: component_mod_component_init_cc_\
 at /global/cfs/cdirs/e3sm/ndk/repos/c31-nov6/driver-mct/main/component_mod.F90:259) [0x848b80]
  0:   /mscratch/sd/n/ndk/e3sm_scratch/muller-cpu/ERS_D.f09_f09.IELM.muller-cpu_nvidia.elm-koch_snowflake.pr7867-allfortflagssame/bld/e3sm.exe(cime_comp_mod_cime_init_: cime_comp_mod_cime_init_ at /global/cfs/\
cdirs/e3sm/ndk/repos/c31-nov6/driver-mct/main/cime_comp_mod.F90:1518) [0x80f9b1]
  0:   /mscratch/sd/n/ndk/e3sm_scratch/muller-cpu/ERS_D.f09_f09.IELM.muller-cpu_nvidia.elm-koch_snowflake.pr7867-allfortflagssame/bld/e3sm.exe(MAIN_: MAIN_ at /global/cfs/cdirs/e3sm/ndk/repos/c31-nov6/driver-m\
ct/main/cime_driver.F90:124) [0x845b0b]
  0:   /mscratch/sd/n/ndk/e3sm_scratch/muller-cpu/ERS_D.f09_f09.IELM.muller-cpu_nvidia.elm-koch_snowflake.pr7867-allfortflagssame/bld/e3sm.exe(main+0x31) [0x8006f1]
  0:   /lib64/libc.so.6(__libc_start_main+0xef) [0x1499a643e1fd]
  0:   /mscratch/sd/n/ndk/e3sm_scratch/muller-cpu/ERS_D.f09_f09.IELM.muller-cpu_nvidia.elm-koch_snowflake.pr7867-allfortflagssame/bld/e3sm.exe(_start: _start at /home/abuild/rpmbuild/BUILD/glibc-2.31/csu/../sy\
sdeps/x86_64/start.S:122) [0x8005da]
srun: error: nid001003: task 0: Exited with exit code 127

This might be pointing to a place where we are reading in a file and perhaps could be a clue.
Yep:

rCode = nf90_open(fileName,nf90_nowrite,fid

Adding write statement (and flush), I can see the file it is trying to read. Which does not seem to have issue.

(e3sm_unified_1.11.1_login) muller-login04% ncdump -k /global/cfs/cdirs/e3sm/inputdata/atm/datm7/atm_forcing.datm7.Qian.T62.c080727/Solar6Hrly/clmforc.Qian.c2006.T62.Solr.1972-01.nc
classic

but it sounds to me like the error much be happening in the nf90_open call itself, which we know from other issues I've debugged, does have a problem. Oh wait -- I know what I did wrong, hold on. OK fixed it -- I had been experimenting with taking out the hack we put in place for this very issue for a different reason. OK, with LD_LIB hack in place (as it is in our repo), and then adding the same fortran flags we currently use for e3sm sources to also build other sources (csm, scorpio), I can get this test to avoid a fault and pass.

Just for completeness:

muller-login02% ./ieee-build.sh 
gnu
gnu with -ffpe-trap=invalid,zero,overflow
gnu with -O3 -ffpe-trap=invalid,zero,overflow
intel
intel with -fpe0
intel with -O3 -fpe0
nvidia
test-procs-mod.F90:
ieee-nvidia-inexact.F90:
nvidia with -O0 -g -Ktrap=fp -Mbounds -Kieee
test-procs-mod.F90:
ieee-nvidia-inexact.F90:
nvidia with -i4 -Mstack_arrays  -Mextend -byteswapio -Mflushz -Kieee -Mallocatable=03 -traceback  -O0 -g -Ktrap=fp -Mbounds -Kieee  -Mfree
test-procs-mod.F90:
ieee-nvidia-inexact.F90:
nvidia with -O2 -g -Ktrap=fp -Kieee
test-procs-mod.F90:
ieee-nvidia-inexact.F90:
test-procs-mod.F90:
ieee-nvidia-inexact.F90:
./ieee-build.sh: line 60: 2184875 Floating point exception./a.out > o.nvidia-opttrap.txt
nvidia with -O2 -g -Ktrap=fp -Mbounds -Kieee
test-procs-mod.F90:
ieee-nvidia-inexact.F90:
test-procs-mod.F90:
ieee-nvidia-inexact.F90:
2.493u 1.017s 0:04.02 87.0% 0pf+0w

With 

program test

   use, intrinsic :: ieee_arithmetic  ! use this with ieee_usual
   !use, intrinsic :: ieee_exceptions ! use this for IEEE_DIVIDE_BY_ZERO
   use procs
   implicit none
   logical :: halt_inexact, inv, divz, ovf, unf, inx
   integer :: j
   integer, parameter :: nlevgrnd=6
   real(8) :: zsoi(nlevgrnd), scalez, zecoeff
   double precision :: a(33),b

   scalez=0.25
   zecoeff=0.25

   call ieee_get_flag(ieee_invalid,         inv)
   call ieee_get_flag(ieee_divide_by_zero,  divz)
   call ieee_get_flag(ieee_overflow,        ovf)
   call ieee_get_flag(ieee_underflow,       unf)
   call ieee_get_flag(ieee_inexact,         inx)
   write(*,*) 'Flags initial: inv=',inv,' divz=',divz,' ovf=',ovf,' unf=',unf,' inx=',inx

   call ieee_get_halting_mode(ieee_inexact, halt_inexact)
   print *, "halt_inexact AT THE TOP: ",halt_inexact

   call ieee_set_flag(ieee_all, .false.)
   do j = 1, nlevgrnd
      zsoi(j) = scalez*(exp(zecoeff*(dble(j)-0.5))-1.0)    !node depths
      call ieee_get_flag(ieee_invalid,         inv)
      call ieee_get_flag(ieee_divide_by_zero,  divz)
      call ieee_get_flag(ieee_overflow,        ovf)
      call ieee_get_flag(ieee_underflow,       unf)
      call ieee_get_flag(ieee_inexact,         inx)
      write(*,*) 'Flags after exp: inv=',inv,' divz=',divz,' ovf=',ovf,' unf=',unf,' inx=',inx
   enddo

   a = &
      [0.3d0,7d11,7d9,4d0,1d0,1d3,1d0,0.2d0, &
      9d11,1d10,5d0,1d3,1d3,1d0,1d3,1d5,2d9, &
      5d0,1d0,8d-1,9d1,4d2,5d-1,5d-1,1d-1,8d6, &
      1d0,0d0,0d0,0d0,0d0,0d0,0d0]
   print*,'check 1'
   call flush(6)
   call sub1(a,b)
   print*,'check 2'
   call flush(6)
end program test

and the module as noted in the nvidia forum

@peterdschwartz
Copy link
Contributor Author

I think all signs point to it being an issue in the compiler mistakenly turning on halting on inexact rather than the codebase, and I don't think it would be worth the effort to try and figure out what part of code gen process is misbehaving.

There is really no use case for trapping ieee_inexact, so I'm happy with the solution I gave: it's clear, robust, and it ensures ieee_inexact is not trapped which is what we need.

If you want to keep exploring, go ahead but there's no reason to hold up this PR while you do that. There are potentially real ERS bugs with nvidia that need to be investigated.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants