Skip to content

Update _get_start_data to always grab the beginning of timestep time#3414

Merged
paulromano merged 6 commits intoopenmc-dev:developfrom
lewisgross1296:fix_continue_h5_bug
Jun 4, 2025
Merged

Update _get_start_data to always grab the beginning of timestep time#3414
paulromano merged 6 commits intoopenmc-dev:developfrom
lewisgross1296:fix_continue_h5_bug

Conversation

@lewisgross1296
Copy link
Copy Markdown
Contributor

@lewisgross1296 lewisgross1296 commented May 19, 2025

Description

As of right now, any interrupted depletion simulation (think max wall time or unintended power off) will have an incorrect time for the time at which the simulation restarts. This happens when continue_timesteps is True or False (independent of #3272) and needs to be fixed. This function is the problem

def _get_start_data(self):
if self.operator.prev_res is None:
return 0.0, 0
return (self.operator.prev_res[-1].time[-1],
len(self.operator.prev_res) - 1)

The fix is actually a one liner .time[-1] -> .time[-2]. I will explain.

Every StepResult in a depletion simulation, has a member time which is a list of float

class StepResult:
"""Result of a single depletion timestep
.. versionchanged:: 0.13.1
Name changed from ``Results`` to ``StepResult``
Attributes
----------
k : list of (float, float)
Eigenvalue and uncertainty for each substep.
time : list of float
Time at beginning, end of step, in seconds.

For every step, the list of float time member has values [t,t+dt] for the given step. When reading in previous results, the last depletion step appears to havedt=0 and thus is stored as [t_final,t_final]. I discovered this by printing out the time_dset[step, :] in the from_hdf5 method each time its called in the Results Constructor.

def __init__(self, filename='depletion_results.h5'):
data = []
if filename is not None:
with h5py.File(str(filename), "r") as fh:
cv.check_filetype_version(fh, 'depletion results', VERSION_RESULTS[0])
# Get number of results stored
n = fh["number"][...].shape[0]
for i in range(n):
data.append(StepResult.from_hdf5(fh, i))
super().__init__(data)

This is a fine thing to do if the simulation never gets killed, but it introduces a problem if you are not guaranteed to complete a depletion simulation to the last step with [t,t]. Basically, if a simulation gets killed, the final time member will look something like this

[86400, 172800]

Since it never finishes this step, in the next run (continue or not), self.operator.prev_res[-1].time[-1] is actually 1782800. However, a simulation that restarts from this previous run should restart at t=86400 and finish the step. In the restart, we're currently starting at t=172800 with the first dt in the new simulation. When this gets written, it appears like the wrong dt exists in the depletion_results.h5 file as the sum of the final dt of the last step and the first new dt for the interrupted step.

Since the first value in the time list of floats member will always be the time you want in a restart (run to completion or killed), we should change to grab this value instead. Very subtle.

Fixes #3387.

Checklist

  • I have performed a self-review of my own code
  • I have run clang-format (version 15) on any C++ source files (if applicable)
  • I have followed the style guidelines for Python source files (if applicable)
  • I have made corresponding changes to the documentation (if applicable)
  • I have added tests that prove my fix is effective or that my feature works (if applicable)

@lewisgross1296
Copy link
Copy Markdown
Contributor Author

lewisgross1296 commented May 19, 2025

I've not pushed the fix for now to see the new test exposing the issue fail. I will push the fix after CI gets started.

Also realizing

        return (self.operator.prev_res[-1].time[0],
                len(self.operator.prev_res) - 1)

might be more clear than

        return (self.operator.prev_res[-1].time[-2],
                len(self.operator.prev_res) - 1)

It seems this member variable will always have length 2, so perhaps a 0 is more clear.

@lewisgross1296 lewisgross1296 changed the title add assertions to tests and add new test to ensure a killed continue … Update _get_start_data to always grab the beginning of timestep time May 19, 2025
@lewisgross1296
Copy link
Copy Markdown
Contributor Author

lewisgross1296 commented May 20, 2025

Looks like I made a silly mistake regarding ==, just fixed that by replacing with numpy.array_equal. Though, I'm realizing the third failure was for a reason I didn't expect

E   OSError: Error reading file 'continue_model.xml': failed to load "continue_model.xml": No such file or directory

Perhaps I'm misunderstanding what files Github Actions has access to, but I thought that pushing XML/h5 files to /tests/unit_tests would mean the test could access them.

Either way @paulromano, I'm curious about your thoughts on this test in general, since it feels a little non-standard. I think it's important to test for this case though. Still going to hold off on pushing the fix until this test fails in the way we expect (the same as in #3387)

@gonuke
Copy link
Copy Markdown
Contributor

gonuke commented May 20, 2025

Though, I'm realizing the third failure was for a reason I didn't expect

E   OSError: Error reading file 'continue_model.xml': failed to load "continue_model.xml": No such file or directory

Perhaps I'm misunderstanding what files Github Actions has access to, but I thought that pushing XML/h5 files to /tests/unit_tests would mean the test could access them.

It looks like the standars/best practice is to use Path(__file__).parents[] to access the relative path to the test and find the file you want.

@lewisgross1296
Copy link
Copy Markdown
Contributor Author

lewisgross1296 commented May 20, 2025

Tested locally and now this commit should show only a failure for test_deplete_continue.py::test_killed_and_continue which exposes the current issue with the _get_start_data(self) method in abc.py. It should show that the last simulation time before the job was killed has the wrong timestep (a dt=5 where it should be dt=2) when output from the final results.

After that is the only failure, I will push the correction and show that the commit now passes tests and can be merged

@lewisgross1296
Copy link
Copy Markdown
Contributor Author

Was hoping there would be more printout from Github Actions (and this time it stopped after failing this test), but it looks like this is the only failing test in test_deplete_continue.py (which has had updates)

tests/unit_tests/test_deplete_continue.py::test_continue PASSED          [ 46%]
tests/unit_tests/test_deplete_continue.py::test_continue_continue PASSED [ 46%]
tests/unit_tests/test_deplete_continue.py::test_killed_and_continue 
Error: Process completed with exit code 255.

Locally, I get this as printout

>       assert np.array_equal(np.diff(final_res.get_times(time_units="d")),[1.0, 2.0, 3.0, 4.0])
E       AssertionError: assert False
E        +  where False = <function array_equal at 0x7c744c974870>(array([1., 5., 3., 4.]), [1.0, 2.0, 3.0, 4.0])
E        +    where <function array_equal at 0x7c744c974870> = np.array_equal
E        +    and   array([1., 5., 3., 4.]) = <function diff at 0x7c744b3f4bb0>(array([ 0.,  1.,  6.,  9., 13.]))
E        +      where <function diff at 0x7c744b3f4bb0> = np.diff
E        +      and   array([ 0.,  1.,  6.,  9., 13.]) = get_times(time_units='d')
E        +        where get_times = [<StepResult: t=0.0, dt=86400.0, source=35000.0>, <StepResult: t=86400.0, dt=172800.0, source=35000.0>, <StepResult: t...rce=35000.0>, <StepResult: t=777600.0, dt=345600.0, source=35000.0>, <StepResult: t=1123200.0, dt=0.0, source=35000.0>].get_times

/home/lgross/openmc/tests/unit_tests/test_deplete_continue.py:93: AssertionError

I will now push the fix

@lewisgross1296
Copy link
Copy Markdown
Contributor Author

lewisgross1296 commented May 20, 2025

Hmm so not expecting this failure... I was testing with python 3.11.11, so I switched to 3.12.8 (pyenv randomly doesn't have 3.12.10)

Locally, I'm getting this

lgross@ulam:~/openmc/tests/unit_tests (fix_continue_h5_bug) $ pytest test_deplete_continue.py 
===================================== test session starts =====================================
platform linux -- Python 3.12.8, pytest-8.3.5, pluggy-1.6.0
rootdir: /home/lgross/openmc
configfile: pytest.ini
collected 5 items                                                                                                                                                                                                                                                      

test_deplete_continue.py .....                                                                                                                                                                                                                                   [100%]

====================================== warnings summary ======================================
tests/unit_tests/test_deplete_continue.py: 768 warnings
  /home/lgross/.pyenv/versions/3.12.8/lib/python3.12/multiprocessing/popen_fork.py:66: 
  DeprecationWarning: This process (pid=3925931) is multi-threaded, use of fork() may 
  lead to deadlocks in the child. self.pid = os.fork()

-- Docs: https://docs.pytest.org/en/stable/how-to/capture-warnings.html
=============================== 5 passed, 768 warnings in 55.76s ===============================

I'm only getting this message from GitHub Actions,

Error: Process completed with exit code 255.

so I'll try updating the commit message to spawn an action-tmate session

EDIT: it seems like prepending [gha-debug] didn't allow the tmate session to spawn 🤔
tmate_failure

@lewisgross1296 lewisgross1296 force-pushed the fix_continue_h5_bug branch 3 times, most recently from a244474 to 624559d Compare May 21, 2025 16:56
…estep time so it will work whether a previous simulation completes or is interrupted
@lewisgross1296
Copy link
Copy Markdown
Contributor Author

I realized that there already exists a (much) simpler chain in the repo, so I switched to using that. I also realized that the tests use the NNDC cross sections and that the XML/h5 files I generated used ENDF-B-VIII.0. There are some nuclide differences (e.g. handling of carbon) which might be causing the error. To eliminate possible sources of error, I regenerate the continue_depletion_results.h5 file and XML to be more consistent with the testing framework.

Copy link
Copy Markdown
Contributor

@gonuke gonuke left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A few edits of the comments you added

Comment thread openmc/deplete/abc.py Outdated
Comment thread openmc/deplete/abc.py Outdated
Comment thread openmc/deplete/abc.py Outdated
Comment thread openmc/deplete/abc.py Outdated
Comment thread openmc/deplete/abc.py Outdated
Co-authored-by: Paul Wilson <paul.wilson@wisc.edu>
Comment thread openmc/deplete/abc.py
Copy link
Copy Markdown
Contributor

@gonuke gonuke left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This all looks good to me - thanks @lewisgross1296

@gonuke
Copy link
Copy Markdown
Contributor

gonuke commented May 22, 2025

In case @paulromano may be wondering, this is basically a 1 line PR (2 character, in fact) other than the tests...

Copy link
Copy Markdown
Contributor

@paulromano paulromano left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@lewisgross1296 Thanks for the fix here (and thanks for the nudge @gonuke). The one-line change looks good to me. The test, however, is a little more problematic because if and when we have other changes that require updating the continue_depletion_results.h5 file, there is no easy way to do that. Other tests recognize when you call pytest --update and will update their reference result files accordingly, but I don't see any way of doing that for this test since it requires killing a process. Are you OK with just removing the test for the time being? Obviously not ideal but honestly I would prefer to have no test rather than have a test that is difficult to update in the future.

@gonuke
Copy link
Copy Markdown
Contributor

gonuke commented Jun 4, 2025

Probably fine to get rid of the test from our perspective. We briefly discussed whether we could generate a test that would result in an interrupted job, thus creating this file on the fly, but I'm not sure if that's possible.

@paulromano
Copy link
Copy Markdown
Contributor

Ok, sounds good. I'll go ahead and remove the test and merge this. If you guys are able to figure out a good way to test it, feel free to follow up with another PR but no sweat otherwise.

@paulromano paulromano enabled auto-merge (squash) June 4, 2025 21:34
@lewisgross1296
Copy link
Copy Markdown
Contributor Author

Yeah, the test seems difficult to maintain since I had to locally kill the simulation to create that h5 file. It probably makes the most sense to remove it, but knowing the test passed here is hopefully sufficient.

If something in the future causes the continue runs to break, we can cross that bridge when we get there. Hopefully this change is resistant to being broken. The updated doc string should help anyone who might need to use or change the _get_start_data() function for future development.

Thanks for finishing this up @paulromano!

@paulromano paulromano merged commit e14bb88 into openmc-dev:develop Jun 4, 2025
14 checks passed
@lewisgross1296 lewisgross1296 deleted the fix_continue_h5_bug branch February 5, 2026 01:45
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

restarting killed depletion simulations improperly saving h5 data

3 participants