Performance improvements to heatmap path#4549
Performance improvements to heatmap path#4549BioTurboNick wants to merge 6 commits intoJuliaPlots:masterfrom
Conversation
Codecov ReportBase: 90.96% // Head: 90.93% // Decreases project coverage by
Additional details and impacted files@@ Coverage Diff @@
## master #4549 +/- ##
==========================================
- Coverage 90.96% 90.93% -0.03%
==========================================
Files 40 40
Lines 7744 7753 +9
==========================================
+ Hits 7044 7050 +6
- Misses 700 703 +3
Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here. ☔ View full report at Codecov. |
|
Any idea why colorbars use a different extrema logic than the spatial axis limits? @t-bltg Colorbars ignore nans, axis limits ignore infinites EDIT: Infs break the colorbar display, so I'm thinking it should skip them too. |
Clueless :) If the current test suite passes with your modifications, I'd say go for it. |
|
It would be nice to fix #2410 along the way ;) |
|
I've tried immutable Extrema objects and other organizations; I often end up losing some optimization the compiler is doing in the current state of this PR. I've poked around at this quite a bit right now, and I think we're up against the minimum possible time if we hold these invariants:
If feels ridiculous that the vast majority of computation time, for large numbers of large series, is spent finding extrema. Idea - Provide an option to control extrema-finding:
Sampling would maybe become the default, taking in all data up to a threshold and then sampling the rest. Strided would likely be more performant and deterministic than random, though slight danger of a patterned input missing important values. Thoughts? |
Agreed. Traversing large series can be expensive, but it is sometimes needed. Providing an option is indeed a good idea, since OPs problem is a bit of a niche one. OP provided another case, which would be worth investigating (are the slow paths the same as for x = randn(100, 1_000)
y = randn(100, 1_000)
z = randn(100, 1_000)
scatter(x, y, zcolor=z) # slowish
scatter(vec(x), vec(y), zcolor=vec(z)) # fast ?You can maybe find inspiration from #4536, which uses a deterministic "distance to series" algorithm for computing the legend position. |
|
Yes, it's almost entirely in paths that affect all plots - updating colorbar limits and axis extents. If I just simply say "take the first 1024 points, then sample ~1024 more in stride" for both I drastically cut the computation time. |
|
I noticed that during the (I think last stage?) of series recipes, the data is copied out and also infinites and missings are replaced with NaNs at the same time. So a couple notes:
|
Of course.
I agree that this is the way to go: store some minimal statistics about a series in a struct or a nt e.g. extrema, has/has not nans, has/has not inf, ... |
Partially addresses #4520
Type inferrence is often a big issue here:
EDIT: Looks like the switch to
foreachin that method in the master branch actually sidestepped the inference issue, but the annotation is probably for the best.Noticed that
expand_extremawas constructing and discarding a vector of all heatmap square edges when it just needed the limits.As number of series grows, a lot of allocations are spent constructing color gradients - even when they're all the same - I changed it to look up the original object instead.
A big time sink is having to check for infinites when calculating the extrema. If I remove that check:
Borrowing the approach from NaNMath's functions, we can get this closer:
The
NaNMath.extremafunction (used by the colorbar limits) is also pretty slow; replacing it withminimum(x), maximum(x)we get