I tested removing this and the speed up was huge - 500M/hr versus 450M/hr on the MI300X GPUs.
So I think we should either remove it or put in something so it is chosen at compile time.
I asked on the slack and no one expressed a strong preference for using it, but several people said it would be nice to keep it if possible.