Sorry for the previous chaos, I thought these parts will not be publish as part of the package.
The following changes have been made:
- The .so file is uploaded to gist as an artifact, so that there no more binary in the repo now.
- I relocated all the files into folder src, test and benchmark.
- Scripts used for benchmarks are given, including the fall back implementation in CUDA.jl. However I found something strange: it seems that CUDA.@sync do not work when using the function from a .so lib, so I failed the benchmark our code in julia.
The new benchmark result is show here:

Originally posted by @ArrogantGao in #1 (comment)