Is this a duplicate?
Area
CUB
Is your feature request related to a problem? Please describe.
Clusters come at a cost that's justified for large segments. For segments under 4k elements, launching multiple CTAs is pure overhead.
Describe the solution you'd like
We should find a way to improve performance for small segments (512 - 4096 elements). A couple of ideas to investigate:
- launch two kernels:
- one CTA per segment that exits if segment is too large
- N CTAs per segment that exits or work-steals if segment is too small
- always launch N CTAs but exit in every CTA but the first one for small segments
This issue can be closed by a PR that brings cluster-level algorithm to ~10% perf window window of existing air TopK.
Describe alternatives you've considered
No response
Additional context
No response
Is this a duplicate?
Area
CUB
Is your feature request related to a problem? Please describe.
Clusters come at a cost that's justified for large segments. For segments under 4k elements, launching multiple CTAs is pure overhead.
Describe the solution you'd like
We should find a way to improve performance for small segments (512 - 4096 elements). A couple of ideas to investigate:
This issue can be closed by a PR that brings cluster-level algorithm to ~10% perf window window of existing air TopK.
Describe alternatives you've considered
No response
Additional context
No response