Add inline and inbounds annotations to unsafe_dot#639
Conversation
Codecov ReportAll modified and coverable lines are covered by tests ✅
Additional details and impacted files@@ Coverage Diff @@
## master #639 +/- ##
=======================================
Coverage 98.13% 98.13%
=======================================
Files 19 19
Lines 3277 3277
=======================================
Hits 3216 3216
Misses 61 61 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
|
Any idea how much difference the |
|
It is mostly the inline. Rerunning the benchmark: Without the @inbounds: ▅▅▆█▆█████▇████▇▅▁▃▃▁▄▆▁▁▄▃▆▃▅▅▅▅▄▄▃▄▁▅▁▁▁▁▄▁▁▁▁▁▁▁▁▁▁▃▁▁▃▃ ▃ Memory estimate: 16 bytes, allocs estimate: 1. Without the @inline julia> @benchmark filt!(output, interp_fil, data) ▃▁▅█▁▅▅▆██▆▅█▆█▅█▅███▆▆██▃▆▃▁▃▃▁▆▃▃▅▁▃▁▅▅▆▃▁▅▃▁▃▁▃▁▅▁▁▁▁▁▁▅ ▃ Memory estimate: 16 bytes, allocs estimate: 1. |
|
Thanks. In my opinion, overriding the compiler heuristics needs good justification, and those benchmarks surely qualify. The additional |
|
But if this helps only interpolation, maybe we can use call-site inlining for just that? |
Good point. Can you try that, @jacobga1998? |
|
Seems to give the same benchmarks, I uploaded it. |
wheeheee
left a comment
There was a problem hiding this comment.
Just a style preference, I like the inline macro right beside the function called. Other than that, LGTM. Also, a little benchmarking suggests to me that BLAS.dot is slower than the Julia version for small kernel lengths, but that's a separate issue and there is no urgent need to change that.
Co-authored-by: wheeheee <104880306+wheeheee@users.noreply.github.com>
|
Agreed, that looks nicer. Added the suggestions. |
Hello,
I was using a FIRInterpolation filter and found that it was a bottleneck in my code. When profiling I realized that the compiler wasn't inlining the unsafe_dot function, and subsequently that there was a missing inbounds for the first multiplication. In the benchmark below the execution time after adding
@inlineand@inboundsis about half. I also tried longer filters, and decimation kernels, and didn't find any difference in the execution time, but I added the annotation to the other functions which did not already have them in case there is a scenario where it matters.Let me know if you think this is worth merging.
Before PR(julia 1.11)
After PR