Skip to content

Flux agnostic @generated 3D conservative volume turbo kernel#3090

Open
MarcoArtiano wants to merge 33 commits into
mainfrom
ma/generated_turbo
Open

Flux agnostic @generated 3D conservative volume turbo kernel#3090
MarcoArtiano wants to merge 33 commits into
mainfrom
ma/generated_turbo

Conversation

@MarcoArtiano

@MarcoArtiano MarcoArtiano commented Jun 18, 2026

Copy link
Copy Markdown
Contributor

After some discussion with @ranocha , it would be nice to have a generic volume turbo kernel, without the need of copy pasting the whole machinery and specialize it for each flux. In TrixiAtmo, that process would be quite annoying. @ranocha suggested to look at @generated functions as we need hand loop over generic nvariables and precomputed variables. Therefore we need a kernel that writes the equivalent hand written code in dg_compressible_euler_3d, but it is general for these two variables.

Claude AI has assisted me in the creation of the PR.

16 threads, for p4est_3d_dgsem/elixir_euler_ec.jl with tspan = (10.0, 0.0).

Plain turbo flux_ranocha_turbo

BenchmarkTools.Trial: 4 samples with 1 evaluation per sample.
 Range (min  max):  1.572 s    1.624 s  ┊ GC (min  max): 0.00%  1.62%
 Time  (median):     1.579 s              ┊ GC (median):    0.00%
 Time  (mean ± σ):   1.589 s ± 24.180 ms  ┊ GC (mean ± σ):  0.41% ± 0.81%

  █    █    █                                             █
  █▁▁▁▁█▁▁▁▁█▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁█ ▁
  1.57 s         Histogram: frequency by time        1.62 s <

 Memory estimate: 27.09 MiB, allocs estimate: 45651.

Generated code with FluxVolumeTurbo(flux_ranocha)

BenchmarkTools.Trial: 4 samples with 1 evaluation per sample.
 Range (min  max):  1.596 s    1.671 s  ┊ GC (min  max): 0.00%  1.87%
 Time  (median):     1.609 s              ┊ GC (median):    0.00%
 Time  (mean ± σ):   1.621 s ± 35.386 ms  ┊ GC (mean ± σ):  0.48% ± 0.93%

  █                  ▁                                    ▁
  █▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁█▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁█ ▁
  1.6 s          Histogram: frequency by time        1.67 s <

 Memory estimate: 27.10 MiB, allocs estimate: 45679.

Plain implementation with flux_ranocha (just for completeness)

BenchmarkTools.Trial: 3 samples with 1 evaluation per sample.
 Range (min  max):  2.430 s    2.554 s  ┊ GC (min  max): 0.00%  0.00%
 Time  (median):     2.432 s              ┊ GC (median):    0.00%
 Time  (mean ± σ):   2.472 s ± 71.102 ms  ┊ GC (mean ± σ):  0.00% ± 0.00%

  █                                                       ▁
  █▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁█ ▁
  2.43 s         Histogram: frequency by time        2.55 s <

 Memory estimate: 27.09 MiB, allocs estimate: 45653.

The current implementation allows us also to use any volume flux, even though it is not guaranteed that the code will work with the macro @turbo.

Plain implementation for flux_shima_etal

BenchmarkTools.Trial: 3 samples with 1 evaluation per sample.
 Range (min  max):  2.028 s    2.066 s  ┊ GC (min  max): 0.00%  0.00%
 Time  (median):     2.037 s              ┊ GC (median):    0.00%
 Time  (mean ± σ):   2.043 s ± 19.837 ms  ┊ GC (mean ± σ):  0.00% ± 0.00%

  █            █                                          █
  █▁▁▁▁▁▁▁▁▁▁▁▁█▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁█ ▁
  2.03 s         Histogram: frequency by time        2.07 s <

 Memory estimate: 27.09 MiB, allocs estimate: 45654.

FluxVolumeTurbo(flux_shima_etal), note that this flux does not have the specialization in the generated code. It means that is is using the generated code, but we hare not precomputing the primitive variables.

BenchmarkTools.Trial: 3 samples with 1 evaluation per sample.
 Range (min  max):  1.661 s    1.693 s  ┊ GC (min  max): 0.00%  0.36%
 Time  (median):     1.662 s              ┊ GC (median):    0.00%
 Time  (mean ± σ):   1.672 s ± 18.437 ms  ┊ GC (mean ± σ):  0.12% ± 0.21%

  █                                                       ▁
  █▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁█ ▁
  1.66 s         Histogram: frequency by time        1.69 s <

 Memory estimate: 27.10 MiB, allocs estimate: 45680.

Turbo hand-written implementation of flux_shima_etal_turbo. Here we are precomputing the primitive variables, as we are using the pre-existing optimization. That can also be specified with the generic generated code.

BenchmarkTools.Trial: 4 samples with 1 evaluation per sample.
 Range (min  max):  1.446 s    1.473 s  ┊ GC (min  max): 0.00%  0.60%
 Time  (median):     1.448 s              ┊ GC (median):    0.00%
 Time  (mean ± σ):   1.454 s ± 13.157 ms  ┊ GC (mean ± σ):  0.15% ± 0.30%

  █  ██                                                   █
  █▁▁██▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁█ ▁
  1.45 s         Histogram: frequency by time        1.47 s <

 Memory estimate: 27.09 MiB, allocs estimate: 45649.

Nonconservative terms benchmarks

The nonconservative implementation has been moved to #3094.
MHD with nonconservative terms for p4est_3d_dgsem/elixir_mhd_alfven_wave_nonperiodic.jl
Plain implementation volume_flux = (flux_hindenlang_gassner, flux_nonconservative_powell)

BenchmarkTools.Trial: 4 samples with 1 evaluation per sample.
 Range (min  max):  1.318 s    1.366 s  ┊ GC (min  max): 0.00%  2.25%
 Time  (median):     1.324 s              ┊ GC (median):    0.00%
 Time  (mean ± σ):   1.333 s ± 22.366 ms  ┊ GC (mean ± σ):  0.58% ± 1.13%

  ██           █                                          █
  ██▁▁▁▁▁▁▁▁▁▁▁█▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁█ ▁
  1.32 s         Histogram: frequency by time        1.37 s <

 Memory estimate: 33.32 MiB, allocs estimate: 9460.

FluxVolumeTurbo(flux_hindenlang_gassner, flux_nonconservative_powell) this does not precompute the primitive variables.

BenchmarkTools.Trial: 6 samples with 1 evaluation per sample.
 Range (min  max):  982.983 ms    1.035 s  ┊ GC (min  max): 0.00%  2.90%
 Time  (median):     984.219 ms              ┊ GC (median):    0.00%
 Time  (mean ± σ):   993.311 ms ± 20.522 ms  ┊ GC (mean ± σ):  0.50% ± 1.19%

  ██      ▁                                                  ▁
  ██▁▁▁▁▁▁█▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁█ ▁
  983 ms          Histogram: frequency by time          1.03 s <

 Memory estimate: 33.35 MiB, allocs estimate: 9512.

Combined approach in p4est_3d_dgsem/elixir_mhd_alfven_wave_combined_fluxes_nonperiodic.jl

BenchmarkTools.Trial: 4 samples with 1 evaluation per sample.
 Range (min  max):  1.226 s     1.565 s  ┊ GC (min  max): 0.00%  0.00%
 Time  (median):     1.231 s               ┊ GC (median):    0.00%
 Time  (mean ± σ):   1.313 s ± 167.761 ms  ┊ GC (mean ± σ):  0.11% ± 0.23%

  █▁                                                       ▁
  ██▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁█ ▁
  1.23 s         Histogram: frequency by time         1.56 s <

 Memory estimate: 33.32 MiB, allocs estimate: 9446.

So, the generic code that accepts theoretically any volume flux, without the additional effort of specializing, already provides a decent speed up. If someone is willing to invest the time to write down the small 2 specialized functions, to precompute primitive variables and the flux in terms of primitive variables, then, as for the preexisting implementation, we reach the same speed up, with reduce effort of copy pasting the whole volume kernel and writing the flux in each direction. In summary, three ingredients need to be specified:

  • number of flux auxiliary variables (or precomputed variables)
  • transformation from cons to flux precomputed variables
  • numerical flux that accepts directly the precomputed variables

@github-actions

Copy link
Copy Markdown
Contributor

Review checklist

This checklist is meant to assist creators of PRs (to let them know what reviewers will typically look for) and reviewers (to guide them in a structured review process). Items do not need to be checked explicitly for a PR to be eligible for merging.

Purpose and scope

  • The PR has a single goal that is clear from the PR title and/or description.
  • All code changes represent a single set of modifications that logically belong together.
  • No more than 500 lines of code are changed or there is no obvious way to split the PR into multiple PRs.

Code quality

  • The code can be understood easily.
  • Newly introduced names for variables etc. are self-descriptive and consistent with existing naming conventions.
  • There are no redundancies that can be removed by simple modularization/refactoring.
  • There are no leftover debug statements or commented code sections.
  • The code adheres to our conventions and style guide, and to the Julia guidelines.

Documentation

  • New functions and types are documented with a docstring or top-level comment.
  • Relevant publications are referenced in docstrings (see example for formatting).
  • Inline comments are used to document longer or unusual code sections.
  • Comments describe intent ("why?") and not just functionality ("what?").
  • If the PR introduces a significant change or new feature, it is documented in NEWS.md with its PR number.

Testing

  • The PR passes all tests.
  • New or modified lines of code are covered by tests.
  • New or modified tests run in less then 10 seconds.

Performance

  • There are no type instabilities or memory allocations in performance-critical parts.
  • If the PR intent is to improve performance, before/after time measurements are posted in the PR.

Verification

  • The correctness of the code was verified using appropriate tests.
  • If new equations/methods are added, a convergence test has been run and the results
    are posted in the PR.

Created with ❤️ by the Trixi.jl community.

@codecov

codecov Bot commented Jun 18, 2026

Copy link
Copy Markdown

Codecov Report

❌ Patch coverage is 93.18182% with 12 lines in your changes missing coverage. Please review.
✅ Project coverage is 96.87%. Comparing base (c9f3657) to head (72547ef).

Files with missing lines Patch % Lines
src/auxiliary/math.jl 7.69% 12 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main    #3090      +/-   ##
==========================================
- Coverage   96.88%   96.87%   -0.01%     
==========================================
  Files         647      648       +1     
  Lines       50035    50211     +176     
==========================================
+ Hits        48475    48639     +164     
- Misses       1560     1572      +12     
Flag Coverage Δ
unittests 96.87% <93.18%> (-0.01%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Harness.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@MarcoArtiano MarcoArtiano changed the title WIP: @generated 3D volume turbo kernel WIP: Flux agnostic @generated 3D volume turbo kernel Jun 19, 2026
Comment thread src/solvers/dgsem_structured/dg_3d_turbo.jl Outdated
Comment thread src/solvers/dgsem_structured/dg_3d_turbo.jl Outdated
Comment on lines +602 to +603
@inline function volume_flux_turbo(volume_flux::typeof(flux_ranocha_turbo),
have_nonconservative_terms::False,

@MarcoArtiano MarcoArtiano Jun 19, 2026

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not happy with this design choice, as I would like to avoid the type of the turbo flux. However, this choice avoids repeating the line can_turbo for each new turbo flux.

@MarcoArtiano MarcoArtiano changed the title WIP: Flux agnostic @generated 3D volume turbo kernel Flux agnostic @generated 3D conservative volume turbo kernel Jun 20, 2026
@MarcoArtiano MarcoArtiano marked this pull request as ready for review June 20, 2026 08:09
Comment thread src/solvers/dgsem_structured/dg_3d_turbo.jl Outdated
@MarcoArtiano MarcoArtiano requested a review from ranocha June 20, 2026 08:25
Comment thread src/equations/numerical_fluxes.jl Outdated
MarcoArtiano and others added 3 commits June 20, 2026 10:43
Co-authored-by: Marco Artiano <57838732+MarcoArtiano@users.noreply.github.com>
Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>

@ranocha ranocha left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! Can you please add a NEWS.md entry as well?

Comment thread src/auxiliary/math.jl Outdated
Comment thread src/auxiliary/math.jl Outdated
Comment thread src/auxiliary/math.jl
Comment thread src/auxiliary/math.jl
Comment thread src/equations/numerical_fluxes.jl
Comment thread src/solvers/dgsem_structured/dg_3d_turbo.jl Outdated
Comment thread src/solvers/dgsem_structured/dg_3d_turbo.jl Outdated
Comment thread src/solvers/dgsem_structured/dg_3d_turbo.jl Outdated
Comment thread test/test_performance_specializations_3d.jl Outdated
Comment thread test/test_performance_specializations_3d.jl Outdated

@ranocha ranocha left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! Could you please also add a test with FluxTurbo for something where no specialization is implemented, e.g., on a mesh not supported at the moment or with some strange numerical types not supported by LoopVectorization.jl, e.g., BigFloat on a simple and small 1D problem?

Comment thread src/Trixi.jl Outdated
Comment thread src/equations/numerical_fluxes.jl
Comment thread src/equations/numerical_fluxes.jl
Comment on lines +623 to +633
@inline function volume_flux_turbo(volume_flux, have_nonconservative_terms::False,
aux_and_normals_and_equations...)
equations = last(aux_and_normals_and_equations)
n = nvariables(equations)
u_ll = SVector(ntuple(v -> aux_and_normals_and_equations[v], Val(n)))
u_rr = SVector(ntuple(v -> aux_and_normals_and_equations[n + v], Val(n)))
normal_direction = SVector(aux_and_normals_and_equations[end - 3],
aux_and_normals_and_equations[end - 2],
aux_and_normals_and_equations[end - 1])
return volume_flux(u_ll, u_rr, normal_direction, equations)
end

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you please adapt the "aux" name here as well, e.g., turbovars or something like that for consistency with the other names?

Co-authored-by: Hendrik Ranocha <ranocha@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants