-
Notifications
You must be signed in to change notification settings - Fork 230
[WIP] Diagnostics: Fix Restart with Start/Stop Moving Window #6399
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: development
Are you sure you want to change the base?
[WIP] Diagnostics: Fix Restart with Start/Stop Moving Window #6399
Conversation
After restarts, all diagnostics lo/hi were not properly restored. This causes new checkpoints and diagnostics to become corrupted, as the wrong spatial data gets filtered. This fixes it.
|
@titoiride can you potentially test this PR with your data, too? :) Please restart from a checkpoint that was written from a simulation that ran from the beginning (not a checkpoint created by a restarted simulation). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks a for this PR!
In addition to adapting the code to the start/end of the moving window, it seems that this PR is making more changes. Was this intentional?
For instance, the previous code used warpx.getmoving_window_x() to infer the current position of the moving window, while the new code relies instead on current_step to infer the current position.
In principle, warpx.getmoving_window_x() should be able to work with starting/stopping moving window. To avoid duplicating code (as you pointed out), should we remove the function warpx.getmoving_window_x() (since it is not used anymore) or should we try to fix it and introduce it again?
| const amrex::Real displacement = | ||
| warpx.getmoving_window_x() - warpx.Geom(0).ProbLo(moving_dir); | ||
| const int shift_num_base = static_cast<int> | ||
| (displacement / warpx.Geom(0).CellSize(moving_dir)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The issue here is subtle.
In FullDiagnostics::MovingWindowAndGalileanDomainShift this shift is only done for steps where if (WarpX::moving_window_active(step+1)) an accumulates the m_lo/m_hi on truncated integer (cell) locations.
The implementation here omitted these details and thus introduces a drift to the real dimensions on restart.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just for context, some of this had been changed recently in #5985. More information on what bug this was fixing in the PR description.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
attn @bnara : can you double-check this PR does not introduce any issues with your existing moving-window workflows? This should improve the logic when moving windows start later than step 0 or stop at a certain step.
I was a bit puzzled as well why I could not find a solution with I think this can be fixed by fully removing the double book-keeping in In other words, While the over all moving window does the same: |
After restarts, all diagnostics lo/hi bounds were not properly restored. This was specifically the case if the moving window had a non-zero start step and/or a stop step before the checkpoint step. After a restart, this causes new checkpoints and diagnostics to become corrupted, as the wrong spatial data gets filtered.
This fixes it. Existing checkpoints (from simulations that started from 0) are still readable with this fix.
Fix #6392
To Do
Cleanup Notice
This bug shows how risky it is to duplicate moving-window logic at multiple places. The moving/shift logic should go into functions and they should be re-used. This PR makes this anti-pattern even worse by doubling down.
A follow-up or additional commit should deduplicate the logic to make it safer: #6400