Bug Description
I've been attempting to fix a subtle and ominous failure of OpenMC since early April (see forum post). Essentially, my model encountered an extremely rare event in which a particle hit the maximum number of events and failed to bank a fission site (causing MPI to abort). After months of debugging, I think we finally have an idea about what exactly was going wrong and a way to fix it. The fix is straightforward, but I need to motivate the change before proposing it.
First, I ran openmc -g to see if there were overlaps in my geometry, which did occur very rarely. As discussed on the forum, it appears the particle was detecting overlaps but actually was not overlapped -- it "tunneled," as in it's true location did not match up with the cell it thought it was in. At this point, I changed OpenMC to mark this particle as lost as soon as an overlap was detected to get a particle restart file.
With the restart and particle number, I used a trace to follow along its history. With some extra prints in particle.cpp, after a certain point, the distance to surface become negative. This causes the particle to cross back and forth over the same surface with the distance becoming more and more negative with each crossing until it became negative infinity. This will always win the test between surface/collision since it's negative, so it gets stuck crossing over and over again until the maximum events is hit leading to my observed failure. Below shows the first time distance goes from positive to negative
within event_advance()
boundary().surface()=8
boundary().distance() = 1.41871, collision_distance() = 0.203848, distance_cutoff = 1.79769e+308
distance winner = 0.203848
current cell in lowest coord level = 565
current position = (11.629, 227.611, 70.7238)
current direction = (0.831235, -0.53098, 0.267503)
(n,elastic) with Be9. Energy = 0.2622440068665347 eV.
within event_advance()
boundary().surface()=8
boundary().distance() = -6.33239, collision_distance() = 3.979, distance_cutoff = 1.79769e+308
distance winner = -6.33239
current cell in lowest coord level = 565
current position = (11.7985, 227.503, 70.7783)
current direction = (0.0121339, -1.00645, 0.182194)
Crossing surface 231
Entering cell 671
within event_advance()
boundary().surface()=-8
boundary().distance() = -44.2446, collision_distance() = 0.520681, distance_cutoff = 1.79769e+308
distance winner = -44.2446
current cell in lowest coord level = 566
current position = (11.7216, 233.876, 69.6246)
current direction = (0.0121339, -1.00645, 0.182194)
Crossing surface 231
Entering cell 670
...
within event_advance()
boundary().surface()=-8
boundary().distance() = -inf, collision_distance() = 2.22087, distance_cutoff = 1.79769e+308
distance winner =
Crossing surface 231
Entering cell 670
While this showed how the particle failed, it did not explain why. With more prints about particle properties, it became clear that somehow the particle direction norm became unnormalized. The norm also grew over time until the negative distance was observed. I believe that axis_aligned_cylinder_distance in surface.cpp expects a normalized direction vector and the unnormalized one produced by rotate_angle is what caused this behavior. Some more prints show that rotate_angle is being passed valid parameters but due to floating point error, it produces an unnormalized direction vector.
within event_advance()
boundary().distance() = 0.818526, collision_distance() = 0.2025, distance_cutoff = 1.79769e+308
distance winner = 0.2025
within rotate_angle()
rotate_angle function inputs
input u = (0.0833668, -0.421571, 0.902956)
input mu = -0.679666
input phi =0.311143
computed dir = (0.285961, -0.288468, -0.913791)
norm of said dir = 1
(n,elastic) with O16. Energy = 0.19535414815002738 eV.
within event_advance()
boundary().distance() = 0.897738, collision_distance() = 1.39221, distance_cutoff = 1.79769e+308
distance winner = 0.897738
Crossing lattice 342. Current position (11,4,0). r=(51.120248374578296,
221.72407437267995, 34.51621068287729)
within event_advance()
boundary().distance() = 1.63008, collision_distance() = 2.12254, distance_cutoff = 1.79769e+308
distance winner = 1.63008
Crossing lattice 342. Current position (12,4,0). r=(51.58638820904518,
221.25384842430924, 33.02665679743267)
within event_advance()
boundary().distance() = 0.503595, collision_distance() = 0.529611, distance_cutoff = 1.79769e+308
distance winner = 0.503595
Crossing surface 231
Entering cell 664
within event_advance()
boundary().distance() = 0.942796, collision_distance() = 0.656483, distance_cutoff = 1.79769e+308
distance winner = 0.656483
within rotate_angle()
rotate_angle function inputs
input u = (0.285961, -0.288468, -0.913791)
input mu = 0.25835
input phi =4.48897
computed dir = (-0.457453, -0.876646, -0.149138)
norm of said dir = 1.00001
(n,elastic) with Be9. Energy = 0.22144002012246095 eV.
...
within event_advance()
boundary().distance() = 1.41871, collision_distance() = 0.203848, distance_cutoff = 1.79769e+308
distance winner = 0.203848
within rotate_angle()
rotate_angle function inputs
input u = (0.831235, -0.53098, 0.267503)
input mu = 0.568352
input phi =4.67434
computed dir = (0.0121339, -1.00645, 0.182194)
norm of said dir = 1.02288
(n,elastic) with Be9. Energy = 0.2622440068665347 eV.
within event_advance()
boundary().distance() = -6.33239, collision_distance() = 3.979, distance_cutoff = 1.79769e+308
distance winner = -6.33239
Crossing surface 231
Entering cell 671
...
It seems that due to small floating point errors, rotate_angle produces a direction vector whose norm is greater than 1. I propose that we normalize it in this function to prevent that from happening. It shouldn't really affect the physics and in most cases, the norm stays 1 anyways. I tested with this change and the particle does not get stuck bounding back and forth over the surface and is actually able to finish the history.
Tagging @gonuke @pshriwise @paulromano @GuySten for update. I will issue a PR assuming that people are okay with this kind of change.
Steps to Reproduce
My XML are provided on the linked forum post. If desired, I can push a branch with all my printouts in OpenMC, but I tried my best to share outputs where necessary. I can share my restart file too.
Environment
OpenMC base upon which I added printouts
Version | 0.15.3-dev311
Commit Hash | fd1bc26af0375f00a51cc5ca094471bb06f66006
OS
PRETTY_NAME="Ubuntu 24.04.1 LTS"
NAME="Ubuntu"
VERSION_ID="24.04"
VERSION="24.04.1 LTS (Noble Numbat)"
VERSION_CODENAME=noble
ID=ubuntu
ID_LIKE=debian
Bug Description
I've been attempting to fix a subtle and ominous failure of OpenMC since early April (see forum post). Essentially, my model encountered an extremely rare event in which a particle hit the maximum number of events and failed to bank a fission site (causing MPI to abort). After months of debugging, I think we finally have an idea about what exactly was going wrong and a way to fix it. The fix is straightforward, but I need to motivate the change before proposing it.
First, I ran
openmc -gto see if there were overlaps in my geometry, which did occur very rarely. As discussed on the forum, it appears the particle was detecting overlaps but actually was not overlapped -- it "tunneled," as in it's true location did not match up with the cell it thought it was in. At this point, I changed OpenMC to mark this particle as lost as soon as an overlap was detected to get a particle restart file.With the restart and particle number, I used a trace to follow along its history. With some extra prints in
particle.cpp, after a certain point, the distance to surface become negative. This causes the particle to cross back and forth over the same surface with the distance becoming more and more negative with each crossing until it became negative infinity. This will always win the test between surface/collision since it's negative, so it gets stuck crossing over and over again until the maximum events is hit leading to my observed failure. Below shows the first time distance goes from positive to negativeWhile this showed how the particle failed, it did not explain why. With more prints about particle properties, it became clear that somehow the particle direction norm became unnormalized. The norm also grew over time until the negative distance was observed. I believe that
axis_aligned_cylinder_distanceinsurface.cppexpects a normalized direction vector and the unnormalized one produced byrotate_angleis what caused this behavior. Some more prints show thatrotate_angleis being passed valid parameters but due to floating point error, it produces an unnormalized direction vector.It seems that due to small floating point errors, rotate_angle produces a direction vector whose norm is greater than 1. I propose that we normalize it in this function to prevent that from happening. It shouldn't really affect the physics and in most cases, the norm stays 1 anyways. I tested with this change and the particle does not get stuck bounding back and forth over the surface and is actually able to finish the history.
Tagging @gonuke @pshriwise @paulromano @GuySten for update. I will issue a PR assuming that people are okay with this kind of change.
Steps to Reproduce
My XML are provided on the linked forum post. If desired, I can push a branch with all my printouts in OpenMC, but I tried my best to share outputs where necessary. I can share my restart file too.
Environment
OpenMC base upon which I added printouts
OS