The acceleration is currently being done via numba (which is a big run-time dependency).
It may be worth investigating re-implementing this in cython to:
- see if we can get any more speed (worth it!)
- change a run-time dependency to a build time dependency (might be worth it at equal speed)
It might also be worth looking into wrapping c++ / c code and either using pybind11 or a direct c-extension.