[release/3.4] fix: use ThrustAllocator in argsort 1D path to avoid implicit cudaStr…#78733
Open
risemeup1111 wants to merge 1 commit intoPaddlePaddle:release/3.4from
Open
Conversation
PaddlePaddle#78726) * fix: use ThrustAllocator in argsort 1D path to avoid implicit cudaStreamSynchronize The 1D argsort path uses thrust::sort_by_key / thrust::stable_sort_by_key with the default execution policy (thrust::cuda::par.on(stream)), which causes thrust to allocate temporary workspace via cudaMalloc/cudaFree. These are synchronous CUDA API calls that implicitly trigger cudaStreamSynchronize, draining all pending GPU work on the stream. This creates a false data dependency: if any prior kernels (e.g. backward weight gradient GEMMs) are still executing on the same stream, argsort blocks until they complete — adding ~2ms of unnecessary stall per call. The fix passes phi::memory_utils::ThrustAllocator to the thrust execution policy, routing temporary allocations through Paddle's caching allocator (which is async and does not synchronize). This is consistent with other Paddle kernels that use thrust (e.g. unique_kernel.cu, shuffle_batch_kernel.cu). nsys evidence (500K int32 argsort after a 4096x4096 matmul): Before: 3x cudaStreamSynchronize + 1x cudaMalloc + 1x cudaFree per call wall time = 2.5ms (steady state) After: 0x cudaStreamSynchronize, 0x cudaMalloc (caching allocator hit) wall time = 0.07ms (expected, matching the 2D CUB path) Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * polish --------- Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
|
你的PR提交成功,感谢你对开源项目的贡献! |
Contributor
|
/re-run all-failed |
2 similar comments
Contributor
|
/re-run all-failed |
Contributor
|
/re-run all-failed |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
PR Category
Operator Mechanism
PR Types
Performance
Description
修复argsort算子在1D case下的低效malloc路径,使其调用框架内存管理机制,避免裸malloc带来的额外同步
pcard-91067
是否引起精度变化
否
Cherry-pick of #78726 (authored by @A-nnonymous) to
release/3.4.devPR:#78726