Avoid adjoint sparse matrix times Diagonal#179
Conversation
Codecov ReportAll modified and coverable lines are covered by tests ✅ 📢 Thoughts on this report? Let us know! |
Benchmark Results
Benchmark PlotsA plot of the benchmark results have been uploaded as an artifact to the workflow run for this PR. |
Pull Request Test Coverage Report for Build 14727735913Details
💛 - Coveralls |
ranocha
left a comment
There was a problem hiding this comment.
Thanks! I guess the reason was that there is a more efficient implementation of the matrix matrix product for sparse matrices vs. sparse and Diagonal matrices.
Could you please file an issue to check for other places where this pattern occurs, e.g.,
(maybe?)
etc. ? We can nevertheless merge this PR for now and release a new version including this fix.
I don't really understand why the sparse times sparse matrix version should be more efficient than the sparse times Diagonal matrix version since there is a specialized method. The diagonal sparse matrix has no zeros on the diagonal so it has the same nnz as the Diagonal. So I would think the sparse times Diagonal version should be as efficient as sparse times sparse, but whatever... I saw that coincidentally there was a PR in SparseArrays.jl merged just yesterday, which addressed Diagonal times sparse matrix products. Maybe this also increased performance for the 3-arg julia> using DispersiveShallowWater, LinearAlgebra, SparseArrays, BenchmarkTools
julia> trixi_include(joinpath(examples_dir(), "serre_green_naghdi_1d", "serre_green_naghdi_dingemans.jl"), tspan = (0.0, 1e-10));
[...]
julia> A = sparse(D1.plus);
julia> B = Diagonal(ones(N));
julia> @benchmark A * B
BenchmarkTools.Trial: 10000 samples with 9 evaluations.
Range (min … max): 2.678 μs … 237.673 μs ┊ GC (min … max): 0.00% … 95.07%
Time (median): 3.301 μs ┊ GC (median): 0.00%
Time (mean ± σ): 4.008 μs ± 7.866 μs ┊ GC (mean ± σ): 13.67% ± 7.12%
▄█▆█▆
▂▄██████▄▃▃▂▂▂▂▂▂▂▂▂▂▂▂▂▁▂▂▂▂▂▂▂▁▂▁▂▁▁▂▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▂▂▂▂▂ ▃
2.68 μs Histogram: frequency by time 10 μs <
Memory estimate: 36.32 KiB, allocs estimate: 10.
julia> @benchmark A' * B
BenchmarkTools.Trial: 2560 samples with 1 evaluation.
Range (min … max): 1.930 ms … 3.540 ms ┊ GC (min … max): 0.00% … 40.94%
Time (median): 1.946 ms ┊ GC (median): 0.00%
Time (mean ± σ): 1.952 ms ± 58.793 μs ┊ GC (mean ± σ): 0.10% ± 1.45%
▃▃▄▆█▆▃▂
▂▃▆██████████▆▅▅▄▄▃▃▃▃▃▃▂▃▃▂▂▂▃▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▁▂▂▂▂▂ ▃
1.93 ms Histogram: frequency by time 2.03 ms <
Memory estimate: 136.45 KiB, allocs estimate: 15.This seems like a missing feature in SparseArrays.jl. I guess the "proper" solution of this would be adding a method for adjoint sparse times Diagonal matrix product. julia> result["serre_green_naghdi_1d/serre_green_naghdi_soliton_upwind.jl"]
BenchmarkTools.Trial: 2413 samples with 1 evaluation.
Range (min … max): 2.020 ms … 4.884 ms ┊ GC (min … max): 0.00% … 0.00%
Time (median): 2.041 ms ┊ GC (median): 0.00%
Time (mean ± σ): 2.070 ms ± 190.473 μs ┊ GC (mean ± σ): 0.95% ± 4.41%
██▄▂▁
██████▄▆▆▃▁▁▁▃▁▁▁▁▁▁▁▁▁▁▃▁▃▁▁▁▁▁▁▁▁▃▁▁▁▁▁▁▁▁▁▁▁▁▃▁▁▄▁▄▅▅▅▄▅ █
2.02 ms Histogram: log(frequency) by time 3.17 ms <
Memory estimate: 553.49 KiB, allocs estimate: 82.with julia> result["serre_green_naghdi_1d/serre_green_naghdi_soliton_upwind.jl"]
BenchmarkTools.Trial: 10000 samples with 1 evaluation.
Range (min … max): 93.304 μs … 5.776 ms ┊ GC (min … max): 0.00% … 34.14%
Time (median): 101.072 μs ┊ GC (median): 0.00%
Time (mean ± σ): 118.737 μs ± 149.333 μs ┊ GC (mean ± σ): 9.74% ± 7.95%
█▃ ▃ ▁
██▆▇█▅▅▄▄▄▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▃▁▁▅ █
93.3 μs Histogram: log(frequency) by time 1.27 ms <
Memory estimate: 542.41 KiB, allocs estimate: 125.with parentheses: julia> result["serre_green_naghdi_1d/serre_green_naghdi_soliton_upwind.jl"]
BenchmarkTools.Trial: 10000 samples with 1 evaluation.
Range (min … max): 85.128 μs … 5.753 ms ┊ GC (min … max): 0.00% … 83.22%
Time (median): 95.589 μs ┊ GC (median): 0.00%
Time (mean ± σ): 115.297 μs ± 167.201 μs ┊ GC (mean ± σ): 10.12% ± 7.22%
▁▄▇███▇▆▆▅▅▅▄▃▃▂▂▁ ▁▁▂▂▂▂▁ ▃
█████████████████████▇▇▆▇▇▇▇▅▆▆▇████████▇█▇▇▇▅▅▇▆▆▅▅▅▆▁▄▃▃▅▁▄ █
85.1 μs Histogram: log(frequency) by time 199 μs <
Memory estimate: 489.81 KiB, allocs estimate: 90.So this means the parentheses solution seams the fastest, which also makes sense to me because it has the most structure (in contrast to the I double-checked the other sparse times Diagonal matrix product and updated the cases, where we had a sparse adjoint times Diagonal. These were not so performance critical though because they only appeared in creating the cache, but not in the |
Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Diagonal
Using
Diagonalslowed down the simulation dramatically as observed in #178. Withspdiagmthis is fixed.Running serre_green_naghdi_soliton_upwind.jl. on
main:─────────────────────────────────────────────────────────────────────────────────────────── DispersiveSWE Time Allocations ─────────────────────── ──────────────────────── Tot / % measured: 3.40s / 99.8% 879MiB / 99.9% Section ncalls time %tot avg alloc %tot avg ─────────────────────────────────────────────────────────────────────────────────────────── rhs! 1.62k 3.39s 100.0% 2.09ms 877MiB 100.0% 554KiB assembling elliptic operator 1.62k 3.27s 96.4% 2.01ms 584MiB 66.5% 368KiB solving elliptic system 1.62k 116ms 3.4% 71.5μs 293MiB 33.4% 185KiB hyperbolic terms 1.62k 4.32ms 0.1% 2.66μs 0.00B 0.0% 0.00B ~rhs!~ 1.62k 620μs 0.0% 382ns 1.25KiB 0.0% 0.79B analyze solution 3 538μs 0.0% 179μs 148KiB 0.0% 49.3KiB ───────────────────────────────────────────────────────────────────────────────────────────This PR (old version with
spdiagm):─────────────────────────────────────────────────────────────────────────────────────────── DispersiveSWE Time Allocations ─────────────────────── ──────────────────────── Tot / % measured: 237ms / 97.5% 861MiB / 99.9% Section ncalls time %tot avg alloc %tot avg ─────────────────────────────────────────────────────────────────────────────────────────── rhs! 1.62k 229ms 99.3% 141μs 860MiB 100.0% 542KiB solving elliptic system 1.62k 118ms 50.9% 72.4μs 293MiB 34.1% 185KiB assembling elliptic operator 1.62k 107ms 46.5% 66.1μs 566MiB 65.9% 357KiB hyperbolic terms 1.62k 4.10ms 1.8% 2.53μs 0.00B 0.0% 0.00B ~rhs!~ 1.62k 515μs 0.2% 317ns 1.25KiB 0.0% 0.79B analyze solution 3 1.52ms 0.7% 506μs 150KiB 0.0% 49.9KiB ───────────────────────────────────────────────────────────────────────────────────────────This PR (new version with
Diagonaland paranthesis):─────────────────────────────────────────────────────────────────────────────────────────── DispersiveSWE Time Allocations ─────────────────────── ──────────────────────── Tot / % measured: 208ms / 96.6% 778MiB / 99.8% Section ncalls time %tot avg alloc %tot avg ─────────────────────────────────────────────────────────────────────────────────────────── rhs! 1.62k 201ms 99.8% 124μs 776MiB 100.0% 490KiB solving elliptic system 1.62k 116ms 57.8% 71.7μs 293MiB 37.8% 185KiB assembling elliptic operator 1.62k 79.5ms 39.5% 49.0μs 483MiB 62.2% 305KiB hyperbolic terms 1.62k 4.20ms 2.1% 2.59μs 0.00B 0.0% 0.00B ~rhs!~ 1.62k 703μs 0.3% 433ns 1.25KiB 0.0% 0.79B analyze solution 3 487μs 0.2% 162μs 148KiB 0.0% 49.3KiB ───────────────────────────────────────────────────────────────────────────────────────────Fixes #178.