Improve performance#1
Conversation
Oooh, I see. Unfortunately, the patch in this PR is not generic enough since, in general, we can't assume any structure outer/left to xs |> Filter(x -> x > 0) |> Map(type_instability) |> OptimizeInner() |> Map(asint)(But it does help me understand the problem. Thanks!) I'm not sure what's the best strategy, though. I think we need something like @please_inline Transducers.next(rf::R_{OptimizeXF}, acc, @nospecialize(input)) = ...in the Julia compiler to fully solve the problem; i.e., the compiler inlines this even though Meanwhile, maybe I should stop trying to support (type_instability(x) for x in xs) |> Map(asint)
# ----------
# JIT'ed |
|
Yeah, I had the fear that simply eliminating that call is not the way to go... About the compiler support: I struggle a lot with inlining, and the possibility to force it would result in measurable performance improvements in the original target of Catwalk.jl, but I never was brave enough to ask for it... Forcing inlining from the call site seems a bit less risky in terms of accidental compilation overhead, and now we have a real use case. Do you think it is time to open an issue? |
|
My guess is that many Julia programmers wished there was a forced/more controllable inlining macro at least once. I couldn't find it in the issue tracker, though, which is kinda strange. Maybe everyone assumed there is already one 😄 . So yeah, I think it'd be nice to have an issue for this. |
|
Great news, @tkf : JuliaLang/julia#41328 allows forced inlining! I have tested this case (only on non-folds, non-catwalk sample code for now, I have package installation issues after compiling 1.8-dev). |
|
Thanks! Yeah, that's great news, esp. for packages heavily depend on higher-order function like Transducers. |
A possible fix of the missing performance gain.
The problem was that
was called before the Catwalked method of
next, resulting in a non-jitted dynamic dispatch.I am not sure though if what I did is reasonable in the larger context, but I hope you can fix it based on this.
Also, the default batch size was too small, so I have increased it to 1e6, which may be more than ideal, more tests are needed.
When testing with
@btime, initial overhead should be small, but I see a small amount of compilation in every Catwalked run, thats why the tested runtimes have to be several seconds. I will check that, but I like to test cold runs with@timeanyway, because Catwalk adds significant compiling overhead, and not measuring it seems unfair.