[Performance Optimization] Rewrite GPU TopK kernel with radix-select and multi-tier sorting by zhengshengning · Pull Request #78409 · PaddlePaddle/Paddle

zhengshengning · 2026-03-20T08:17:51Z

PR Category

Performance Optimization

PR Types

Performance

Description

本 PR 对 GPU TopK 算子进行了重写，采用了基于 radix-select 的高性能实现方案，替代了原有实现，以提升大规模数据场景下的性能表现。

**Topk index 和 Torch 进行了对齐，换成了下面的排序算法！

主要改动

新增 top_k_cuda_kernel.cu：全新的 GPU TopK 实现，包含以下核心算法：
- Radix-Select 算法：通过逐位基数选择高效定位第 k 大/小的元素，避免全量排序
- Multi-Block TopK (mbtopk)：针对大尺寸 slice 场景，使用多 block 协同完成选择
- Single-Block TopK (sbtopk)：针对小尺寸 slice 场景的优化路径
- 多级排序策略：
  - k ≤ 32：Bitonic Sort（双调排序）
  - k ≤ 128：CUB WarpMergeSort
  - k ≤ 4096：CUB BlockRadixSort
  - k > 4096：回退到 ArgsortKernel + TakeAlongAxisKernel
修改 top_k_kernel.cu：将原有 TopkKernel 重命名为 TopkKernelOld，注册为 topk_old，保留供对比参考

性能优势

新实现通过 radix-select 算法在不完全排序的情况下高效选出 top-k 元素，相比原实现在多种 k 值和数据规模下均有显著性能提升。
H800：

A100:

是否引起精度变化

是

Replace the existing GPU TopK implementation with a new radix-select based algorithm and multi-tier sorting strategy for improved performance: - Radix-select for efficient top-k selection - Multi-block top-k (mbtopk) for large slices - Single-block top-k (sbtopk) for smaller slices - Three-tier sort dispatch: Bitonic Sort (k<=32), WarpMergeSort (k<=128), BlockRadixSort (k<=4096), ArgsortKernel fallback (k>4096) - Rename old TopkKernel to TopkKernelOld for reference

paddle-bot · 2026-03-20T08:18:07Z

你的PR提交成功，感谢你对开源项目的贡献!
请关注后续CI自动化测试结果，详情请参考Paddle-CI手册。
Your PR has been submitted. Thanks for your contribution!
Please wait for the result of CI firstly. See Paddle CI Manual for details.

On LP64 Linux, int64_t is typedef of long, not long long. Using int64_t caused duplicate specialization. Restore original long long / unsigned long long types with NOLINT to suppress cpplint, and remove the duplicate int64_t specialization.

When k comes from a tensor, InferMeta may set output dims with -1, making metadata invalid. Calling Alloc before resolving the actual k value triggers PreconditionNotMetError. Fix: move Alloc after FromTensor() resize, add empty-output guard and empty-input handling to match the old kernel behavior.

… into acc_opt_topk

- Bitfield: add HIP fallback using bit shifts instead of PTX asm (bfe.u32/u64, bfi.b32/b64 are NVIDIA PTX only) - getLaneId/getLaneMaskLe/getLaneMaskLt: use HIP intrinsics on __HIPCC__ - CubKeyType<bfloat16>: use hip_bfloat16 instead of __nv_bfloat16 - Replace cudaStream_t with gpuStream_t (Paddle's unified type alias)

gpuStream_t is defined in phi:: namespace (via gpu_decls.h). The helper functions in the anonymous namespace cannot access it without qualification. Add 'using phi::gpuStream_t;' at the top of the anonymous namespace.

- Guard __syncwarp() with #if !defined(__HIPCC__) since HIP/DCU does not provide this intrinsic (AMD wavefronts are lockstep) - Replace cudaMemsetAsync with hipMemsetAsync under PADDLE_WITH_HIP - Use conservative defaults for regsPerMultiprocessor (65536) and maxBlocksPerMultiProcessor on HIP since hipDeviceProp_t lacks these members

… acc_opt_topk

wanghuancoder

LGTM

From00

LGTM

lugimzzz

lgtm

…radix-select and multi-tier sorting #78409 (#78659) * [TopK] Rewrite GPU TopK kernel with radix-select and multi-tier sorting Replace the existing GPU TopK implementation with a new radix-select based algorithm and multi-tier sorting strategy for improved performance: - Radix-select for efficient top-k selection - Multi-block top-k (mbtopk) for large slices - Single-block top-k (sbtopk) for smaller slices - Three-tier sort dispatch: Bitonic Sort (k<=32), WarpMergeSort (k<=128), BlockRadixSort (k<=4096), ArgsortKernel fallback (k>4096) - Rename old TopkKernel to TopkKernelOld for reference * Fix doLdg duplicate definition: restore long long types with NOLINT On LP64 Linux, int64_t is typedef of long, not long long. Using int64_t caused duplicate specialization. Restore original long long / unsigned long long types with NOLINT to suppress cpplint, and remove the duplicate int64_t specialization. * Fix TopkKernel crash: defer Alloc until after FromTensor resize When k comes from a tensor, InferMeta may set output dims with -1, making metadata invalid. Calling Alloc before resolving the actual k value triggers PreconditionNotMetError. Fix: move Alloc after FromTensor() resize, add empty-output guard and empty-input handling to match the old kernel behavior. * Fix HIP/ROCm compilation errors in top_k_cuda_kernel.cu - Bitfield: add HIP fallback using bit shifts instead of PTX asm (bfe.u32/u64, bfi.b32/b64 are NVIDIA PTX only) - getLaneId/getLaneMaskLe/getLaneMaskLt: use HIP intrinsics on __HIPCC__ - CubKeyType<bfloat16>: use hip_bfloat16 instead of __nv_bfloat16 - Replace cudaStream_t with gpuStream_t (Paddle's unified type alias) * Fix Windows build: bring gpuStream_t into anonymous namespace gpuStream_t is defined in phi:: namespace (via gpu_decls.h). The helper functions in the anonymous namespace cannot access it without qualification. Add 'using phi::gpuStream_t;' at the top of the anonymous namespace. * Fix DCU/HIP compilation errors in top_k_cuda_kernel.cu - Guard __syncwarp() with #if !defined(__HIPCC__) since HIP/DCU does not provide this intrinsic (AMD wavefronts are lockstep) - Replace cudaMemsetAsync with hipMemsetAsync under PADDLE_WITH_HIP - Use conservative defaults for regsPerMultiprocessor (65536) and maxBlocksPerMultiProcessor on HIP since hipDeviceProp_t lacks these members * rename tok_cuda_kernel * fix * fix * fix2 * fix * fix2 * fix

…and multi-tier sorting (PaddlePaddle#78409) * [TopK] Rewrite GPU TopK kernel with radix-select and multi-tier sorting Replace the existing GPU TopK implementation with a new radix-select based algorithm and multi-tier sorting strategy for improved performance: - Radix-select for efficient top-k selection - Multi-block top-k (mbtopk) for large slices - Single-block top-k (sbtopk) for smaller slices - Three-tier sort dispatch: Bitonic Sort (k<=32), WarpMergeSort (k<=128), BlockRadixSort (k<=4096), ArgsortKernel fallback (k>4096) - Rename old TopkKernel to TopkKernelOld for reference * Fix doLdg duplicate definition: restore long long types with NOLINT On LP64 Linux, int64_t is typedef of long, not long long. Using int64_t caused duplicate specialization. Restore original long long / unsigned long long types with NOLINT to suppress cpplint, and remove the duplicate int64_t specialization. * Fix TopkKernel crash: defer Alloc until after FromTensor resize When k comes from a tensor, InferMeta may set output dims with -1, making metadata invalid. Calling Alloc before resolving the actual k value triggers PreconditionNotMetError. Fix: move Alloc after FromTensor() resize, add empty-output guard and empty-input handling to match the old kernel behavior. * Fix TopkKernel crash: defer Alloc until after FromTensor resize When k comes from a tensor, InferMeta may set output dims with -1, making metadata invalid. Calling Alloc before resolving the actual k value triggers PreconditionNotMetError. Fix: move Alloc after FromTensor() resize, add empty-output guard and empty-input handling to match the old kernel behavior. * Fix HIP/ROCm compilation errors in top_k_cuda_kernel.cu - Bitfield: add HIP fallback using bit shifts instead of PTX asm (bfe.u32/u64, bfi.b32/b64 are NVIDIA PTX only) - getLaneId/getLaneMaskLe/getLaneMaskLt: use HIP intrinsics on __HIPCC__ - CubKeyType<bfloat16>: use hip_bfloat16 instead of __nv_bfloat16 - Replace cudaStream_t with gpuStream_t (Paddle's unified type alias) * Fix Windows build: bring gpuStream_t into anonymous namespace gpuStream_t is defined in phi:: namespace (via gpu_decls.h). The helper functions in the anonymous namespace cannot access it without qualification. Add 'using phi::gpuStream_t;' at the top of the anonymous namespace. * Fix DCU/HIP compilation errors in top_k_cuda_kernel.cu - Guard __syncwarp() with #if !defined(__HIPCC__) since HIP/DCU does not provide this intrinsic (AMD wavefronts are lockstep) - Replace cudaMemsetAsync with hipMemsetAsync under PADDLE_WITH_HIP - Use conservative defaults for regsPerMultiprocessor (65536) and maxBlocksPerMultiProcessor on HIP since hipDeviceProp_t lacks these members * rename tok_cuda_kernel * fix * fix * fix2 * fix * fix2 * fix

…and multi-tier sorting (#78409) (#78703) * [TopK] Rewrite GPU TopK kernel with radix-select and multi-tier sorting Replace the existing GPU TopK implementation with a new radix-select based algorithm and multi-tier sorting strategy for improved performance: - Radix-select for efficient top-k selection - Multi-block top-k (mbtopk) for large slices - Single-block top-k (sbtopk) for smaller slices - Three-tier sort dispatch: Bitonic Sort (k<=32), WarpMergeSort (k<=128), BlockRadixSort (k<=4096), ArgsortKernel fallback (k>4096) - Rename old TopkKernel to TopkKernelOld for reference * Fix doLdg duplicate definition: restore long long types with NOLINT On LP64 Linux, int64_t is typedef of long, not long long. Using int64_t caused duplicate specialization. Restore original long long / unsigned long long types with NOLINT to suppress cpplint, and remove the duplicate int64_t specialization. * Fix TopkKernel crash: defer Alloc until after FromTensor resize When k comes from a tensor, InferMeta may set output dims with -1, making metadata invalid. Calling Alloc before resolving the actual k value triggers PreconditionNotMetError. Fix: move Alloc after FromTensor() resize, add empty-output guard and empty-input handling to match the old kernel behavior. * Fix TopkKernel crash: defer Alloc until after FromTensor resize When k comes from a tensor, InferMeta may set output dims with -1, making metadata invalid. Calling Alloc before resolving the actual k value triggers PreconditionNotMetError. Fix: move Alloc after FromTensor() resize, add empty-output guard and empty-input handling to match the old kernel behavior. * Fix HIP/ROCm compilation errors in top_k_cuda_kernel.cu - Bitfield: add HIP fallback using bit shifts instead of PTX asm (bfe.u32/u64, bfi.b32/b64 are NVIDIA PTX only) - getLaneId/getLaneMaskLe/getLaneMaskLt: use HIP intrinsics on __HIPCC__ - CubKeyType<bfloat16>: use hip_bfloat16 instead of __nv_bfloat16 - Replace cudaStream_t with gpuStream_t (Paddle's unified type alias) * Fix Windows build: bring gpuStream_t into anonymous namespace gpuStream_t is defined in phi:: namespace (via gpu_decls.h). The helper functions in the anonymous namespace cannot access it without qualification. Add 'using phi::gpuStream_t;' at the top of the anonymous namespace. * Fix DCU/HIP compilation errors in top_k_cuda_kernel.cu - Guard __syncwarp() with #if !defined(__HIPCC__) since HIP/DCU does not provide this intrinsic (AMD wavefronts are lockstep) - Replace cudaMemsetAsync with hipMemsetAsync under PADDLE_WITH_HIP - Use conservative defaults for regsPerMultiprocessor (65536) and maxBlocksPerMultiProcessor on HIP since hipDeviceProp_t lacks these members * rename tok_cuda_kernel * fix * fix * fix2 * fix * fix2 * fix

zhengshengning added 8 commits March 20, 2026 09:46

Merge branch 'acc_opt_topk' of https://github.com/zhengshengning/Paddle…

516c4cc

… into acc_opt_topk

Fix Windows build: bring gpuStream_t into anonymous namespace

bed8908

gpuStream_t is defined in phi:: namespace (via gpu_decls.h). The helper functions in the anonymous namespace cannot access it without qualification. Add 'using phi::gpuStream_t;' at the top of the anonymous namespace.

rename tok_cuda_kernel

0877f9b

zhengshengning force-pushed the acc_opt_topk branch from c51553e to 0877f9b Compare March 24, 2026 03:38

zhengshengning added 6 commits March 24, 2026 06:28

fix

99adaa1

fix

71ea5fd

fix2

2d48207

Merge branch 'develop' of https://github.com/PaddlePaddle/Paddle into…

4c3533d

… acc_opt_topk

fix

d803df9

fix2

7ddcfb7

zhengshengning mentioned this pull request Apr 13, 2026

[Cherry-pick][Performance Optimization] Rewrite GPU TopK kernel with radix-select and multi-tier sorting #78409 #78659

Merged

zhengshengning added 2 commits April 13, 2026 22:41

fix

019829f

Merge branch 'develop' of https://github.com/PaddlePaddle/Paddle into…

2b13698

… acc_opt_topk

wanghuancoder approved these changes Apr 16, 2026

View reviewed changes

From00 approved these changes Apr 16, 2026

View reviewed changes

lugimzzz approved these changes Apr 16, 2026

View reviewed changes

Jiang-Jia-Jun approved these changes Apr 17, 2026

View reviewed changes

zhengshengning merged commit d8f60c6 into PaddlePaddle:develop Apr 17, 2026
144 of 151 checks passed

zhengshengning mentioned this pull request Apr 17, 2026

[Performance Optimization] Rewrite GPU TopK kernel with radix-select … #78703

Merged

This was referenced Apr 19, 2026

[CI] Align logprobs test baselines with Paddle Update PaddlePaddle/FastDeploy#7481

Closed

[Cherry-Pick][CI] Align logprobs baselines with Paddle Update(#7481) PaddlePaddle/FastDeploy#7483

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Performance Optimization] Rewrite GPU TopK kernel with radix-select and multi-tier sorting#78409

[Performance Optimization] Rewrite GPU TopK kernel with radix-select and multi-tier sorting#78409
zhengshengning merged 17 commits intoPaddlePaddle:developfrom
zhengshengning:acc_opt_topk

zhengshengning commented Mar 20, 2026 •

edited

Loading

Uh oh!

paddle-bot Bot commented Mar 20, 2026

Uh oh!

wanghuancoder left a comment

Uh oh!

From00 left a comment

Uh oh!

lugimzzz left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

Conversation

zhengshengning commented Mar 20, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

PR Category

PR Types

Description

主要改动

性能优势

是否引起精度变化

Uh oh!

paddle-bot Bot commented Mar 20, 2026

Uh oh!

wanghuancoder left a comment

Choose a reason for hiding this comment

Uh oh!

From00 left a comment

Choose a reason for hiding this comment

Uh oh!

lugimzzz left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

zhengshengning commented Mar 20, 2026 •

edited

Loading