Closed
Conversation
Member
|
It looks like there is a whole subset of literature for fast softmax approximations. I only read through https://arxiv.org/abs/2111.10770v1, but it has a nice list of prior art. Also of interest may be existing CPU-optimized softmax impls like oneDNN. |
Member
Author
|
Had not looked, but not surprised there's a literature by now! IIRC we dropped the NVidia one as it was slower than NNlib's. The immediate goal though is to make this part small compared to the matmul & permutations. Going by my times here this gets us from roughly 50% to 10%. |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
This defines a
fast_softmaxwhich uses a low-accuracyfast_exp. It's about 5x faster on CPU.On a GPU, the low-accuracy
expisn't faster at all. For small arrays,fast_softmaxis faster, because it skips theall(isfinite, max_)check & thus avoids synchronisation. Thus FluxML/NNlibCUDA.jl#63 should get all the benefit.The alternative on CPU is to make an
Arrayspecialisation using LoopVectorization. That's not as quick as thisfast_exp(about 2x slower for me) but several more digits of precision. Thisfast_expis roughly Float16 precision, do we want that?