修复 compat pinned tensor 的 inplace fill 语义#78666
修复 compat pinned tensor 的 inplace fill 语义#78666SigureMo wants to merge 3 commits intoPaddlePaddle:developfrom
Conversation
Return mapped device-visible pointers for compat CUDA-pinned tensors and add a regression test that exercises direct kernel writes into pin_memory tensors. Co-authored-by: Codex <codex@openai.com>
|
你的PR提交成功,感谢你对开源项目的贡献! |
There was a problem hiding this comment.
Pull request overview
This PR fixes compat pinned-memory tensors so that data_ptr() can be passed directly into CUDA kernels by allocating mapped pinned storage in the compat path and returning a device-visible mapped alias for CUDA-pinned tensors. It also adds a CUDA regression test to validate kernel writes into a compat pinned tensor.
Changes:
- Add compat utilities to allocate mapped pinned host memory and to resolve CUDA-pinned tensors to a kernel-visible pointer.
- Route compat pinned tensor creation/copies through the mapped pinned allocation path.
- Update compat
data_ptr()/(const_)data_ptr<T>()to use the kernel-visible pointer and add a CUDA kernel regression test.
Reviewed changes
Copilot reviewed 9 out of 9 changed files in this pull request and generated 1 comment.
Show a summary per file
| File | Description |
|---|---|
| test/cpp/compat/CMakeLists.txt | Registers the new CUDA regression test in compat test CMake. |
| test/cpp/compat/ATen_pin_memory_kernel_test.cu | Adds a CUDA kernel test that writes into a pinned (host) tensor via data_ptr(). |
| paddle/phi/api/include/compat/utils/mapped_pinned_tensor.h | Introduces mapped pinned allocation helpers and kernel-visible pointer resolution for CUDA/HIP pinned tensors. |
| paddle/phi/api/include/compat/ATen/ops/new_empty.h | Uses mapped pinned allocation when pin_memory=true on new_empty. |
| paddle/phi/api/include/compat/ATen/ops/empty.h | Uses mapped pinned allocation when pin_memory=true on empty. |
| paddle/phi/api/include/compat/ATen/ops/empty_like.h | Uses mapped pinned allocation/copy helper when creating pinned empty_like. |
| paddle/phi/api/include/compat/ATen/core/TensorMethods.cpp | Redirects typed data_ptr/const_data_ptr to the (now kernel-visible) data_ptr() implementation. |
| paddle/phi/api/include/compat/ATen/core/TensorBody.h | Updates Tensor::{data_ptr,const_data_ptr,mutable_data_ptr} and pin_memory() to use mapped pinned helpers / kernel-visible pointer. |
| paddle/phi/api/include/compat/ATen/core/TensorBase.h | Updates TensorBase::data_ptr() to return the kernel-visible pointer and adjusts its documentation accordingly. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| inline void* _PD_GetKernelVisibleDataPtr(const paddle::Tensor& tensor) { | ||
| if (!tensor.defined()) { | ||
| return nullptr; | ||
| } | ||
|
|
||
| #if defined(PADDLE_WITH_CUDA) || defined(PADDLE_WITH_HIP) | ||
| if (phi::is_cuda_pinned_place(tensor.place())) { | ||
| auto dense = std::dynamic_pointer_cast<phi::DenseTensor>(tensor.impl()); | ||
| if (!dense) { | ||
| return const_cast<void*>(tensor.data()); | ||
| } | ||
|
|
||
| auto holder = dense->Holder(); | ||
| if (!holder || holder->ptr() == nullptr) { | ||
| return const_cast<void*>(tensor.data()); | ||
| } | ||
|
|
||
| void* mapped_base = nullptr; | ||
| #ifdef PADDLE_WITH_HIP | ||
| auto err = hipHostGetDevicePointer(&mapped_base, holder->ptr(), 0); | ||
| if (err == hipSuccess && mapped_base != nullptr) { | ||
| return static_cast<char*>(mapped_base) + dense->meta().offset; | ||
| } | ||
| (void)hipGetLastError(); | ||
| #elif defined(PADDLE_WITH_CUDA) | ||
| auto err = cudaHostGetDevicePointer(&mapped_base, holder->ptr(), 0); | ||
| if (err == cudaSuccess && mapped_base != nullptr) { | ||
| return static_cast<char*>(mapped_base) + dense->meta().offset; | ||
| } | ||
| (void)cudaGetLastError(); | ||
| #endif | ||
| } | ||
| #endif | ||
|
|
||
| return const_cast<void*>(tensor.data()); | ||
| } |
There was a problem hiding this comment.
_PD_GetKernelVisibleDataPtr() falls back to returning tensor.data() when cudaHostGetDevicePointer/hipHostGetDevicePointer fails. Since the default GPUPinned allocator uses cudaHostAllocPortable (no Mapped flag) (see paddle/phi/core/memory/allocation/pinned_allocator.cc:40-46), this failure path will be hit for pinned tensors created outside the new mapped allocation helpers, and data_ptr() will again return a non-device-visible host address (contradicting the new “pointer kernels should use” contract). Consider either (a) ensuring all compat pinned-tensor creation/copy paths use _PD_CreateMappedPinnedAllocation (or another mapped/registered strategy), or (b) making this function throw/explicitly error when it cannot obtain a device-visible alias for a CUDA-pinned tensor, to avoid silently returning an unsafe pointer for kernels.
Replace the internal-only DenseTensor::memory_size() call in the compat mapped pinned helper with a header-visible byte count computation so downstream extensions that include compat headers keep compiling. Co-authored-by: Codex <codex@openai.com>
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## develop #78666 +/- ##
==========================================
Coverage ? 93.10%
==========================================
Files ? 4
Lines ? 29
Branches ? 0
==========================================
Hits ? 27
Misses ? 2
Partials ? 0 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
Co-authored-by: Codex <codex@openai.com>
背景
这个 PR 修复的是 compat 路径下
pin_memorytensor 的 inplace 写入语义。之前的实现里,
fill_/zero_这类操作会把gpu_pinnedtensor 通过通用 dispatch 漂移成gpu:0,导致 compat 行为和 torch 不一致。真实 torch 在pin_memorytensor 上执行fill_后仍然保持 pinned host tensor。修复内容
这次改动分两层收敛:
mapped_pinned_tensor路线,恢复 compat pinned tensor 的原生 raw pinned allocation /data_ptr()语义。empty/new_empty/empty_like的 pinned 创建路径恢复为 CPU tensor +copy_to(pinned_place)。fill_/zero_增加 host pinned special-case:先在 CPU 上构造同 shape 的源 tensor,再copy_回 pinned tensor,避免 place 漂移。torch proxy最终拿到的是paddle.Tensor,因此quick_probe.py里的fill_走的是 Pythonpaddle.Tensor.fill_,不是 C++at::Tensor::fill_。python/paddle/tensor/manipulation.py里的fill_/zero_增加同样的 host pinned special-case,保证 Python compat 路径也与 torch 对齐。验证
已验证:
build/test/cpp/ATen_pin_memory_creation_testbuild/test/cpp/compat/ATen_pin_memory_kernel_testbuild/test/cpp/ATen_tensor_data_testPYTHONPATH=build/python python test/legacy_test/test_tensor_fill_.pyPYTHONPATH=build/python python paddle_compat_repro/quick_probe.pyquick_probe.py中fill_之后的 tensor 现在保持为Place(gpu_pinned),copy_路径也保持正常。说明
numpy()在 pinned tensor 上当前仍然是 copy-out 语义,没有在这个 PR 里处理;本次只收敛fill_/zero_/ pinned 创建与 kernel 直传这条问题链。