Open
Description
We've experience a compiler regression issue when compiling the kernel on GB100 with CUDA 12.8, in which situation the kernels are compiled to use much more registers than usual. As such, some kernels failed to launch in our tests, resulting in test failures.
For example, when compiling with sm_75:
ptxas info : Compiling entry function '_ZN16spark_rapids_jni6detail12copy_to_rowsINS0_30fixed_width_row_offset_functorEEEviiiN4cudf11device_spanIKNS0_9tile_infoELm18446744073709551615EEEPPKaPKiSC_T_SC_PPa' for 'sm_75'
ptxas info : Function properties for _ZN16spark_rapids_jni6detail12copy_to_rowsINS0_30fixed_width_row_offset_functorEEEviiiN4cudf11device_spanIKNS0_9tile_infoELm18446744073709551615EEEPPKaPKiSC_T_SC_PPa
0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
ptxas info : Used 34 registers, 16 bytes smem, 432 bytes cmem[0], 12 bytes cmem[2]
vs compiling with sm_100:
ptxas info : Compiling entry function '_ZN16spark_rapids_jni6detail12copy_to_rowsINS0_30fixed_width_row_offset_functorEEEviiiN4cudf11device_spanIKNS0_9tile_infoELm18446744073709551615EEEPPKaPKiSC_T_SC_PPa' for 'sm_100'
ptxas info : Function properties for _ZN16spark_rapids_jni6detail12copy_to_rowsINS0_30fixed_width_row_offset_functorEEEviiiN4cudf11device_spanIKNS0_9tile_infoELm18446744073709551615EEEPPKaPKiSC_T_SC_PPa
0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
ptxas info : Used 72 registers, used 1 barriers, 16 bytes smem
We should proactively prevent such situations from happening again by adding __launch_bounds__
to all our kernels, which can tell the compiler hint to allocate resources better (register/cmem etc.) including not to use too many registers.
That should help the kernels to request a more predictable amount of resources (registers) on any GPU so they can be launched successfully.