fix: use ThrustAllocator in argsort 1D path to avoid implicit cudaStr… by A-nnonymous · Pull Request #78726 · PaddlePaddle/Paddle

A-nnonymous · 2026-04-20T06:40:02Z

PR Category

Operator Mechanism

PR Types

Performance

Description

修复argsort算子在1D case下的低效malloc路径，使其调用框架内存管理机制，避免裸malloc带来的额外同步

pcard-91067

是否引起精度变化

否

…eamSynchronize The 1D argsort path uses thrust::sort_by_key / thrust::stable_sort_by_key with the default execution policy (thrust::cuda::par.on(stream)), which causes thrust to allocate temporary workspace via cudaMalloc/cudaFree. These are synchronous CUDA API calls that implicitly trigger cudaStreamSynchronize, draining all pending GPU work on the stream. This creates a false data dependency: if any prior kernels (e.g. backward weight gradient GEMMs) are still executing on the same stream, argsort blocks until they complete — adding ~2ms of unnecessary stall per call. The fix passes phi::memory_utils::ThrustAllocator to the thrust execution policy, routing temporary allocations through Paddle's caching allocator (which is async and does not synchronize). This is consistent with other Paddle kernels that use thrust (e.g. unique_kernel.cu, shuffle_batch_kernel.cu). nsys evidence (500K int32 argsort after a 4096x4096 matmul): Before: 3x cudaStreamSynchronize + 1x cudaMalloc + 1x cudaFree per call wall time = 2.5ms (steady state) After: 0x cudaStreamSynchronize, 0x cudaMalloc (caching allocator hit) wall time = 0.07ms (expected, matching the 2D CUB path) Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

paddle-bot · 2026-04-20T06:40:09Z

你的PR提交成功，感谢你对开源项目的贡献!
请关注后续CI自动化测试结果，详情请参考Paddle-CI手册。
Your PR has been submitted. Thanks for your contribution!
Please wait for the result of CI firstly. See Paddle CI Manual for details.

ForFishes

LGTM

wanghuancoder

LGTM

PaddlePaddle#78726) * fix: use ThrustAllocator in argsort 1D path to avoid implicit cudaStreamSynchronize The 1D argsort path uses thrust::sort_by_key / thrust::stable_sort_by_key with the default execution policy (thrust::cuda::par.on(stream)), which causes thrust to allocate temporary workspace via cudaMalloc/cudaFree. These are synchronous CUDA API calls that implicitly trigger cudaStreamSynchronize, draining all pending GPU work on the stream. This creates a false data dependency: if any prior kernels (e.g. backward weight gradient GEMMs) are still executing on the same stream, argsort blocks until they complete — adding ~2ms of unnecessary stall per call. The fix passes phi::memory_utils::ThrustAllocator to the thrust execution policy, routing temporary allocations through Paddle's caching allocator (which is async and does not synchronize). This is consistent with other Paddle kernels that use thrust (e.g. unique_kernel.cu, shuffle_batch_kernel.cu). nsys evidence (500K int32 argsort after a 4096x4096 matmul): Before: 3x cudaStreamSynchronize + 1x cudaMalloc + 1x cudaFree per call wall time = 2.5ms (steady state) After: 0x cudaStreamSynchronize, 0x cudaMalloc (caching allocator hit) wall time = 0.07ms (expected, matching the 2D CUB path) Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * polish --------- Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>

risemeup1111 · 2026-04-20T13:38:16Z

✅ Cherry-pick successful! Created PR: #78733

ForFishes previously approved these changes Apr 20, 2026

View reviewed changes

polish

804d958

A-nnonymous dismissed ForFishes’s stale review via 804d958 April 20, 2026 07:01

wanghuancoder approved these changes Apr 20, 2026

View reviewed changes

A-nnonymous added the cherry-pick: release/3.4 label Apr 20, 2026

A-nnonymous merged commit 6b26757 into PaddlePaddle:develop Apr 20, 2026
90 of 93 checks passed

risemeup1111 mentioned this pull request Apr 20, 2026

[release/3.4] fix: use ThrustAllocator in argsort 1D path to avoid implicit cudaStr… #78733

Open

github-actions Bot removed the cherry-pick: release/3.4 label Apr 20, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: use ThrustAllocator in argsort 1D path to avoid implicit cudaStr…#78726

fix: use ThrustAllocator in argsort 1D path to avoid implicit cudaStr…#78726
A-nnonymous merged 2 commits intoPaddlePaddle:developfrom
A-nnonymous:fix/argsort_thrust_allocator

A-nnonymous commented Apr 20, 2026

Uh oh!

paddle-bot Bot commented Apr 20, 2026

Uh oh!

ForFishes left a comment

Uh oh!

wanghuancoder left a comment

Uh oh!

Uh oh!

risemeup1111 commented Apr 20, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

A-nnonymous commented Apr 20, 2026

PR Category

PR Types

Description

是否引起精度变化

Uh oh!

paddle-bot Bot commented Apr 20, 2026

Uh oh!

ForFishes left a comment

Choose a reason for hiding this comment

Uh oh!

wanghuancoder left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

risemeup1111 commented Apr 20, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants