Skip to content

fix: use ThrustAllocator in argsort 1D path to avoid implicit cudaStr…#78726

Merged
A-nnonymous merged 2 commits intoPaddlePaddle:developfrom
A-nnonymous:fix/argsort_thrust_allocator
Apr 20, 2026
Merged

fix: use ThrustAllocator in argsort 1D path to avoid implicit cudaStr…#78726
A-nnonymous merged 2 commits intoPaddlePaddle:developfrom
A-nnonymous:fix/argsort_thrust_allocator

Conversation

@A-nnonymous
Copy link
Copy Markdown
Contributor

PR Category

Operator Mechanism

PR Types

Performance

Description

修复argsort算子在1D case下的低效malloc路径,使其调用框架内存管理机制,避免裸malloc带来的额外同步

pcard-91067

是否引起精度变化

…eamSynchronize

The 1D argsort path uses thrust::sort_by_key / thrust::stable_sort_by_key
with the default execution policy (thrust::cuda::par.on(stream)), which
causes thrust to allocate temporary workspace via cudaMalloc/cudaFree.
These are synchronous CUDA API calls that implicitly trigger
cudaStreamSynchronize, draining all pending GPU work on the stream.

This creates a false data dependency: if any prior kernels (e.g. backward
weight gradient GEMMs) are still executing on the same stream, argsort
blocks until they complete — adding ~2ms of unnecessary stall per call.

The fix passes phi::memory_utils::ThrustAllocator to the thrust execution
policy, routing temporary allocations through Paddle's caching allocator
(which is async and does not synchronize). This is consistent with other
Paddle kernels that use thrust (e.g. unique_kernel.cu, shuffle_batch_kernel.cu).

nsys evidence (500K int32 argsort after a 4096x4096 matmul):
  Before: 3x cudaStreamSynchronize + 1x cudaMalloc + 1x cudaFree per call
          wall time = 2.5ms (steady state)
  After:  0x cudaStreamSynchronize, 0x cudaMalloc (caching allocator hit)
          wall time = 0.07ms (expected, matching the 2D CUB path)

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@paddle-bot
Copy link
Copy Markdown

paddle-bot Bot commented Apr 20, 2026

你的PR提交成功,感谢你对开源项目的贡献!
请关注后续CI自动化测试结果,详情请参考Paddle-CI手册
Your PR has been submitted. Thanks for your contribution!
Please wait for the result of CI firstly. See Paddle CI Manual for details.

ForFishes
ForFishes previously approved these changes Apr 20, 2026
Copy link
Copy Markdown
Member

@ForFishes ForFishes left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Copy link
Copy Markdown
Contributor

@wanghuancoder wanghuancoder left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@A-nnonymous A-nnonymous merged commit 6b26757 into PaddlePaddle:develop Apr 20, 2026
90 of 93 checks passed
risemeup1111 pushed a commit to risemeup1111/Paddle that referenced this pull request Apr 20, 2026
PaddlePaddle#78726)

* fix: use ThrustAllocator in argsort 1D path to avoid implicit cudaStreamSynchronize

The 1D argsort path uses thrust::sort_by_key / thrust::stable_sort_by_key
with the default execution policy (thrust::cuda::par.on(stream)), which
causes thrust to allocate temporary workspace via cudaMalloc/cudaFree.
These are synchronous CUDA API calls that implicitly trigger
cudaStreamSynchronize, draining all pending GPU work on the stream.

This creates a false data dependency: if any prior kernels (e.g. backward
weight gradient GEMMs) are still executing on the same stream, argsort
blocks until they complete — adding ~2ms of unnecessary stall per call.

The fix passes phi::memory_utils::ThrustAllocator to the thrust execution
policy, routing temporary allocations through Paddle's caching allocator
(which is async and does not synchronize). This is consistent with other
Paddle kernels that use thrust (e.g. unique_kernel.cu, shuffle_batch_kernel.cu).

nsys evidence (500K int32 argsort after a 4096x4096 matmul):
  Before: 3x cudaStreamSynchronize + 1x cudaMalloc + 1x cudaFree per call
          wall time = 2.5ms (steady state)
  After:  0x cudaStreamSynchronize, 0x cudaMalloc (caching allocator hit)
          wall time = 0.07ms (expected, matching the 2D CUB path)

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* polish

---------

Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
@risemeup1111
Copy link
Copy Markdown

✅ Cherry-pick successful! Created PR: #78733

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants