The current implementation of cuda::atomic leverages inline PTX for implementing the various atomic operations. Using inline PTX for these operations have historically led to bad codegen because the inline PTX gets treated as a blackbox to the optimizer.
Ideally, we'd like to update the atomic implementation to take advantage of the new NVVM intrinsics that were added in CUDA 12.8.