-
Notifications
You must be signed in to change notification settings - Fork 446
Use ieee intrinsics to avoid trapping ieee_inexact signals #7867
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Use ieee intrinsics to avoid trapping ieee_inexact signals #7867
Conversation
|
going to perform sanity checks on chrysalis |
|
Noting a reproducer: As it looks to only be a problem with nvidia compiler, it might be good to put this test under ? |
|
Adding ifdefs is unnecessary: trapping |
|
We are unable to reproduce this behavior with a simple example. I tried simply: in both places where we see the error and the test passes. So I'm not sure what's best to do here |
| if (.not. readvar ) then | ||
| ! Variable ZSOI not found, use the ELM parameters. | ||
| if (ieee_support_halting(ieee_inexact)) then | ||
| call ieee_set_flag(ieee_all,.false.) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This line turns off ALL floating point trapping which would override our DEBUG settings. This would have to be wrapped in an #ifdef CPRNIVIDIA if you really want this.
| ! Variable ZSOI not found, use the ELM parameters. | ||
| if (ieee_support_halting(ieee_inexact)) then | ||
| call ieee_set_flag(ieee_all,.false.) | ||
| call ieee_set_halting_mode(ieee_inexact, .false.) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This will turn off ieee_inexact trapping for everything that runs after this making it a global setting, not just for the land (unless the land is running on its own tasks). If we want to set these things globally, it should be done in the driver. Or you should set it back to "on".
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ieee_inexact trappings are already off for everything. None of our code could work if not the case.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is essentially a no-op except for this nvida on perlmutter, which i can only hazard is a bug in the compiler or runtime.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh because ieee_support_halting(ieee_inexact) will be False everywhere except NVIDIA?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it's even false with nvidia ! From the nvidia docs our current flag only captures inv, divz, ovf. iexact is really never used because all transcendental functions are inexact. I have verified this by calling ieee_get_halting_mode(ieee_inexact, halt) which is .false. wherever i call it.
The special flag -Ktrap=none is used to preserve FPEs during compilation without unmasking any of them at runtime.
The inv, divz, and ovf flags are often the most interesting, as they signify abnormal floating-point behavior in the program. These can be enabled with the useful shorthand -Ktrap=fp.
it seems to be a bug in the code gen or AVX2 exp runtime. Found a similar issue of random inexact being tripped in 25.9 https://forums.developer.nvidia.com/t/nvfortran-25-9-spurious-floating-point-exception/346604
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sorry not the support halting but the actual halting flag is false
|
@rljacob For completeness here are the debugging statements i tested to confirm what's happening (to the best of my knowledge at least) ! TOP OF SUBROUTINE
block
logical :: halt_inexact
call ieee_get_halting_mode(ieee_inexact, halt_inexact)
print *, "halt_inexact AT THE TOP: ",halt_inexact
end block
......
block
logical :: inv, divz, ovf, unf, inx
call ieee_set_flag(ieee_all, .false.)
do j = 1, nlevgrnd
zsoi(j) = scalez*(exp(zecoeff*(dble(j)-0.5_r8))-1._r8) !node depths
call ieee_get_flag(ieee_invalid, inv)
call ieee_get_flag(ieee_divide_by_zero, divz)
call ieee_get_flag(ieee_overflow, ovf)
call ieee_get_flag(ieee_underflow, unf)
call ieee_get_flag(ieee_inexact, inx)
write(*,*) 'Flags after exp: inv=',inv,' divz=',divz,' ovf=',ovf,' unf=',unf,' inx=',inx
enddo
end block
output: |
|
What are we trying to show regarding the inexact flag? |
|
@ndkeen this shows that the only ieee flag being set from the You did not show the output from this line: call ieee_get_halting_mode(ieee_inexact, halt_inexact)
print *, "halt_inexact AT THE TOP: ",halt_inexactBut if you did, that would tell you if the program is set to halt (ie trap) the |
you are correct, none have it. But while its odd that inexact is showing T, it's not clear why nvidia would behave differently than others? |
|
It's not odd to show inexact = T after an exp. Based on this thread https://forums.developer.nvidia.com/t/nvfortran-25-9-spurious-floating-point-exception/346604 Someone was able to make a small reproducer with 25.9 and with -O1 flag. It may be worth simply copying that code and seeing if we get the same results. But Matt Colgrove confirms that it shouldn't be happening and is a compiler bug. |
|
OK, using that test code provided, I was able to get a signal. If I build with debug flags and This might be pointing to a place where we are reading in a file and perhaps could be a clue. Adding write statement (and flush), I can see the file it is trying to read. Which does not seem to have issue. but it sounds to me like the error much be happening in the Just for completeness: |
Potentially due to a compiler bug in nvhpc, ELM is halting on
ieee_inexactvalues whenKtrap=fpmode is set.