Write a CUDA implementation of the op. Maybe wrap something like https://github.com/victor-gil-sepulveda/pyRMSD