Open
Description
The wrappers in https://github.com/rapidsai/raft/blob/branch-23.06/cpp/include/raft/core/math.hpp delegate work to the appropriate CUDA intrinsic. However, the CUDA intrinsics for square root, trigonometry functions, etc, have different names for __half
(e.g hsqrt
instead of sqrt
).
To facilitate templated code, we could add overloads of sqrt
calling hsqrt
on the device and similarly for trigonometry functions.
On host, we can either not define those, or use the fp32 functions.