Skip to content

Implement stage-level coupling for split integration#1049

Open
efaulhaber wants to merge 36 commits intotrixi-framework:mainfrom
efaulhaber:stage-coupling
Open

Implement stage-level coupling for split integration#1049
efaulhaber wants to merge 36 commits intotrixi-framework:mainfrom
efaulhaber:stage-coupling

Conversation

@efaulhaber
Copy link
Copy Markdown
Member

@efaulhaber efaulhaber commented Jan 16, 2026

This PR

  • implements stage-level coupling as opposed to the previously used step-level coupling.
  • fixes a memory "leak". The sub-integrator was storing the split solution for every single sub-integration call (each fluid time step), which caused massive VRAM allocations for long-running GPU simulations.

In step-level coupling, the fluid is advanced a full step before the structure is advanced a full step. This reduces the stability of the main time integration, reducing the maximum stable time step by a factor of 2 (in my tests with Carpenter-Kennedy).
Stage-level coupling calls the sub-integration in every fluid RK stage. The advantage is that the stability properties of the time integrator are preserved. In the current version, this does not have a significant performance impact and is therefore the default (edit: it does not work for some RK schemes with non-monotonic stage times and is therefore not the default). Only for small ratios (like 2-5x smaller time step for the structure), the step-level coupling might be more efficient. For details on how this is implemented, check out my comment below. The implemented version is the one without "restart", and "predict" is enabled by default but can be disabled via kwarg.

Δt Step-level coupling Stage-level coupling
9e-4 grafik grafik
1e-3 grafik grafik
1.8e-3 grafik

For 1.9e-3, I get "instability detected" with stage-level coupling. The maximum stable time step of 1.8e-3 with stage-level coupling is the same as when making the ball a moving solid wall boundary (non-elastic).
Note that we cannot do this with a CFL because the StepsizeCallback has an upper limit of 1.2e-3 due to #1048.

Note that something is still not right, as the larger time step makes the ball fall deeper with stage-level coupling. (Edit: This is fixed with "predict" below.)

@efaulhaber efaulhaber self-assigned this Jan 16, 2026
@efaulhaber efaulhaber added the enhancement New feature or request label Jan 16, 2026
@efaulhaber
Copy link
Copy Markdown
Member Author

I now implemented a slightly different benchmark simulation. A TLSPH square with E=1e8 and rho=1200 is fully submerged in a fluid with rho=1000. It starts with zero velocity and is then slowly sinking. I measure the stability in the form of the largest stable time step and the accuracy in the form of deviation from a simulation with a very small time step.
grafik

The version I showed above was stage-coupling with "restart" (the first row in the table below), which means I reset the sub-integrator in every stage to the state of the previous time step. Without "restart", the sub-integrator is integrated to the first stage time with the fluid state prediction of the first stage, and then continues from there in the second stage, etc.
"Predict" means I apply an explicit Euler step with the structure velocity to predict the structure position at the stage time as u += v * (t_new - t_previous). Then I use this prediction to compute $F_\text{fluid}$.
"Deviation" is the (normalized) difference in the y-coordinates of the square at t=1.5 between the split integration and the non-split integration (fluid and structure integrated together at Δt=1.44e-4), which I consider the reference for this simulation (time integration error almost zero). A negative deviation means the square sank too fast, a positive deviation means it didn't sink fast enough.

Stage-coupling Restart Predict max Δt Deviation @ Δt=8e-4 Deviation @ Δt=1.6e-3 #Sub-steps @ Δt=8e-4 #Sub-steps @ Δt=1.6e-3
1.6e-3 -8.47e-2 -1.91e-1 38k 35.2k
1.6e-3 -2.50e-4 1.47e-3 38k 35.2k
1.6e-3 -2.01e-2 -3.86e-2 15k 12.3k
1.6e-3 -2.91e-4 -3.95e-4 15k 12.3k
8.0e-4 -4.39e-2 11.2k
8.0e-4 3.70e-2 11.2k

Interpreting these results, we can see that the methods without position prediction all have a significant negative deviation (obviously even more pronounced with the larger time step, for the stage-level coupling that is stable at this time step), indicating an underestimation of $F_\text{fluid}$. This is reasonable because $F_\text{fluid}$ is computed at the previous time step (or stage for the "restart" methods), at which the square is higher up, so the forces from the fluid are underestimated. This effect is smaller for non-restart methods because they compute $F_\text{fluid}$ at the previous stage instead of going further back to the previous step. Prediction significantly increases the accuracy.

The "restart" methods compute the final structure state by integrating from $t_n$ to $t_{n+1}$ with a constant $F_\text{fluid}$ computed at $t_n$. The non-restart methods instead compute $F_\text{fluid}$ based on the predicted fluid state for each stage (which can be considered less accurate since it's a prediction), but re-compute $F_\text{fluid}$ for each state (which, in turn, is more accurate). The resulting method has similar accuracy in this benchmark (higher even for the larger time step) and the same stability, but the number of sub-steps is close to the step-coupling method, whereas the restart methods require 2-3x more stub-steps.
Note that a larger ratio between fluid and structure time step further reduces the overhead of the non-restart methods compared to the step-coupling method.

In summary, the non-restart stage-coupling method with prediction is more stable (allows for a 2x larger fluid time step), more accurate, and not more expensive (does not require significantly more structure sub-steps) than the previous implementation of step-level coupling. For testing purposes, I added kwargs, so all methods can still be tested. I am not sure if the non-restart will work as well with every time integrator as it did with Carpenter-Kennedy.

Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR implements stage-level coupling for split integration of TotalLagrangianSPHSystems, and refactors the ODE problem parameter p to carry both the semidiscretization and split-integration runtime payload.

Changes:

  • Change semidiscretize/ODE p payload from Semidiscretization to a NamedTuple with p.semi plus p.split_integration_data.
  • Extend SplitIntegrationCallback with stage_coupling / predict_positions options and stage-time integration support.
  • Update callbacks/IO/visualization code paths to use integrator.p.semi / sol.prob.p.semi.

Reviewed changes

Copilot reviewed 14 out of 14 changed files in this pull request and generated 2 comments.

Show a summary per file
File Description
test/callbacks/info.jl Update mock integrator payload to match p.semi access pattern.
src/visualization/recipes_plots.jl Adapt plotting recipe and solution type alias to the new p payload shape.
src/io/io.jl Read semidiscretization/metadata from integrator.p.semi.
src/general/semidiscretization.jl Construct p=(; semi, split_integration_data=nothing) and adjust kick!/drift! signatures accordingly.
src/callbacks/update.jl Use integrator.p.semi.
src/callbacks/stepsize.jl Use integrator.p.semi; broaden callback type check for parametric SplitIntegrationCallback.
src/callbacks/steady_state_reached.jl Use integrator.p.semi.
src/callbacks/split_integration.jl Implement stage coupling + new payload storage under p.split_integration_data.
src/callbacks/solution_saving.jl Use integrator.p.semi.
src/callbacks/post_process.jl Use integrator.p.semi.
src/callbacks/info.jl Use integrator.p.semi.
src/callbacks/density_reinit.jl Use integrator.p.semi.
ext/TrixiParticlesOrdinaryDiffEqExt.jl Adjust extension code to read p.semi.
docs/src/refs.bib Remove JabRef metadata comments.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@codecov
Copy link
Copy Markdown

codecov bot commented Mar 2, 2026

Codecov Report

❌ Patch coverage is 93.46405% with 10 lines in your changes missing coverage. Please review.
✅ Project coverage is 89.52%. Comparing base (3afa38f) to head (df670ab).
⚠️ Report is 1 commits behind head on main.

Files with missing lines Patch % Lines
src/callbacks/split_integration.jl 92.03% 9 Missing ⚠️
src/visualization/recipes_plots.jl 0.00% 1 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main    #1049      +/-   ##
==========================================
- Coverage   89.54%   89.52%   -0.03%     
==========================================
  Files         127      127              
  Lines        9654     9710      +56     
==========================================
+ Hits         8645     8693      +48     
- Misses       1009     1017       +8     
Flag Coverage Δ
total 89.52% <93.46%> (-0.03%) ⬇️
unit 67.30% <15.23%> (-0.32%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 14 out of 14 changed files in this pull request and generated 2 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 16 out of 16 changed files in this pull request and generated 2 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 16 out of 16 changed files in this pull request and generated 1 comment.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@efaulhaber efaulhaber marked this pull request as ready for review March 2, 2026 13:38
@efaulhaber efaulhaber requested review from LasNikas and svchb March 2, 2026 13:38
@efaulhaber
Copy link
Copy Markdown
Member Author

/run-gpu-tests

@efaulhaber efaulhaber marked this pull request as draft March 2, 2026 17:20
@efaulhaber efaulhaber force-pushed the stage-coupling branch 2 times, most recently from c798ff9 to d749eb3 Compare March 3, 2026 13:36
@efaulhaber
Copy link
Copy Markdown
Member Author

@copilot Find the issue with the failing CI runs and suggest a fix.

Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 23 out of 23 changed files in this pull request and generated 4 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@efaulhaber
Copy link
Copy Markdown
Member Author

/run-gpu-tests

@efaulhaber efaulhaber marked this pull request as ready for review March 4, 2026 09:12
@svchb
Copy link
Copy Markdown
Collaborator

svchb commented Mar 5, 2026

Since you are changing the extension can you please address this issue #1047

@efaulhaber
Copy link
Copy Markdown
Member Author

/run-gpu-tests

@efaulhaber efaulhaber requested review from LasNikas and svchb March 31, 2026 08:30
Comment on lines +198 to +202
# Tell OrdinaryDiffEq that `u` has NOT been modified.
# Theoretically, the TLSPH part has been modified, but in the FSAL case,
# the time at the last stage is the same as the step time, so the split integration
# above is skipped and `u` is not modified at all.
# Therefore, the derivative at the last stage can be reused for the next step.
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

# Tell OrdinaryDiffEq that `u` was not modified.
# In the FSAL case, the last stage occurs at the step time,
# so the split integration is skipped and `u` remains unchanged.
# Therefore, the derivative at the last stage can be reused for the next step.

Comment on lines +277 to +279
# We modify `v_ode` and `u_ode`, which is technically not allowed during stages,
# hence there are no guarantees about the structure part of `v_ode` and `u_ode`.
# By copying the current split integration values, we make sure that it's correct.
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

unclear

foreach_system(semi_split) do system
# Construct string for the interactions timer.
# Avoid allocations from string construction when no timers are used.
# TODO do we need to disable timers in split integration?
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

todo?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement New feature or request high priority

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants