A potential solution could be to run the flagship config on CUDA in CI checks, ensuring SYPD is within a certain tolerance, and that the slowest kernel name hasn't changed. The latter case would indicate a kernel that should be optimized before merging, unless performance is less important than accuracy and stability at this stage.
A potential solution could be to run the flagship config on CUDA in CI checks, ensuring SYPD is within a certain tolerance, and that the slowest kernel name hasn't changed. The latter case would indicate a kernel that should be optimized before merging, unless performance is less important than accuracy and stability at this stage.