Skip to content

[FEA]: Optimize cluster TopK algorithm for small segment sizes #9548

Description

@gevtushenko

Is this a duplicate?

Area

CUB

Is your feature request related to a problem? Please describe.

Clusters come at a cost that's justified for large segments. For segments under 4k elements, launching multiple CTAs is pure overhead.

Describe the solution you'd like

We should find a way to improve performance for small segments (512 - 4096 elements). A couple of ideas to investigate:

  • launch two kernels:
    • one CTA per segment that exits if segment is too large
    • N CTAs per segment that exits or work-steals if segment is too small
  • always launch N CTAs but exit in every CTA but the first one for small segments

This issue can be closed by a PR that brings cluster-level algorithm to ~10% perf window window of existing air TopK.

Describe alternatives you've considered

No response

Additional context

No response

Metadata

Metadata

Assignees

Labels

No labels
No labels

Type

No type
No fields configured for issues without a type.

Projects

Status
Todo

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions