[FEA] Add `__launch_bounds__` specifier to all kernels

We've experience a compiler regression issue when compiling the kernel on GB100 with CUDA 12.8, in which situation the kernels are compiled to use much more registers than usual. As such, some kernels failed to launch in our tests, resulting in test failures.

For example, when compiling with sm_75:
```
ptxas info    : Compiling entry function '_ZN16spark_rapids_jni6detail12copy_to_rowsINS0_30fixed_width_row_offset_functorEEEviiiN4cudf11device_spanIKNS0_9tile_infoELm18446744073709551615EEEPPKaPKiSC_T_SC_PPa' for 'sm_75'
ptxas info    : Function properties for _ZN16spark_rapids_jni6detail12copy_to_rowsINS0_30fixed_width_row_offset_functorEEEviiiN4cudf11device_spanIKNS0_9tile_infoELm18446744073709551615EEEPPKaPKiSC_T_SC_PPa
    0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
ptxas info    : Used 34 registers, 16 bytes smem, 432 bytes cmem[0], 12 bytes cmem[2]
```
vs compiling with sm_100:
```
ptxas info    : Compiling entry function '_ZN16spark_rapids_jni6detail12copy_to_rowsINS0_30fixed_width_row_offset_functorEEEviiiN4cudf11device_spanIKNS0_9tile_infoELm18446744073709551615EEEPPKaPKiSC_T_SC_PPa' for 'sm_100'
ptxas info    : Function properties for _ZN16spark_rapids_jni6detail12copy_to_rowsINS0_30fixed_width_row_offset_functorEEEviiiN4cudf11device_spanIKNS0_9tile_infoELm18446744073709551615EEEPPKaPKiSC_T_SC_PPa
    0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
ptxas info    : Used 72 registers, used 1 barriers, 16 bytes smem
```

We should proactively prevent such situations from happening again by adding `__launch_bounds__` to all our kernels, which can tell the compiler hint to allocate resources better (register/cmem etc.) including not to use too many registers.
That should help the kernels to request a more predictable amount of resources (registers) on any GPU so they can be launched successfully.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[FEA] Add `__launch_bounds__` specifier to all kernels #3002

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[FEA] Add __launch_bounds__ specifier to all kernels #3002

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

[FEA] Add `__launch_bounds__` specifier to all kernels #3002