Conversation
Signed-off-by: sykwer <sykwer@gmail.com>
Design Principles
API ReferenceMessage Types (
|
| Type | Base | data field |
|---|---|---|
agnocast::cuda::PointCloud2 |
sensor_msgs::msg::PointCloud2 + cuda_message_tag |
uint8_t* (GPU device pointer, shadows base std::vector) |
agnocast::cuda::Image |
sensor_msgs::msg::Image + cuda_message_tag |
uint8_t* (GPU device pointer, shadows base std::vector) |
CPU metadata (header, width, height, point_step, etc.) lives in shared memory. data points to GPU device memory allocated via cudaMalloc.
To define a custom CUDA message type, inherit from both the ROS message and agnocast::cuda_message_tag, shadow the data field, and specialize get_cuda_gpu_data_size<T>():
struct MyGpuMsg : public my_msgs::msg::MyMsg, public agnocast::cuda_message_tag {
uint8_t* data = nullptr;
};
template <> size_t agnocast::get_cuda_gpu_data_size(const MyGpuMsg& msg) {
return msg.width * msg.height * msg.bytes_per_pixel;
}User packages must link against agnocast_cuda to resolve the get_backend() symbol (declared in agnocastlib, defined in agnocast_cuda). Note that find_package(agnocast_cuda) always succeeds (even without CUDA) because the library has no build-time CUDA dependency. Packages that compile .cu files or call CUDA APIs directly (e.g., cudaMalloc) must also check for the CUDA toolchain:
include(CheckLanguage)
check_language(CUDA)
find_package(agnocast_cuda QUIET)
if(CMAKE_CUDA_COMPILER AND agnocast_cuda_FOUND)
enable_language(CUDA)
find_package(CUDAToolkit REQUIRED)
# ... build CUDA targets here
endif()Publisher
// Creation — identical to non-CUDA Agnocast publishers
auto pub = node->create_publisher<agnocast::cuda::PointCloud2>("/topic", qos);
// Usage
auto msg = pub->borrow_loaned_message();
msg->width = n; msg->point_step = 16; // CPU metadata → shared memory
cudaMalloc(&msg->data, n * 16); // GPU allocation (not intercepted by heaphook)
my_kernel<<<grid, block>>>(msg->data, ...); // fill on GPU
pub->publish(std::move(msg)); // msg invalidated after thisInternally, publish() exports a CUDA IPC handle, stores GpuMetadata in shared memory (heaphook active), then issues the standard publish ioctl. On reclaim, backend.free_device_memory() is called (which maps to cudaFree on CudaIpcBackend).
Subscriber
// Creation — identical to non-CUDA Agnocast subscribers
auto sub = node->create_subscription<agnocast::cuda::PointCloud2>("/topic", qos, callback);
// Callback
void callback(agnocast::ipc_shared_ptr<const agnocast::cuda::PointCloud2> msg) {
uint32_t width = msg->width; // CPU metadata: msg->field
auto* gpu_ptr = static_cast<uint8_t*>(msg.gpu_data()); // GPU data: msg.gpu_data()
my_kernel<<<grid, block>>>(gpu_ptr, ...);
}msg->data is the publisher's device pointer — invalid in the subscriber. Always use msg.gpu_data().
Holding a copy of msg (or any ipc_shared_ptr copy) keeps the GPU mapping alive.
ipc_shared_ptr Extensions
| Method | Description |
|---|---|
void* gpu_data() const |
Returns subscriber-local GPU device pointer. nullptr for non-CUDA messages. |
Bridge
CUDA types are automatically excluded from bridging. If you need to send to a non-Agnocast node, cudaMemcpy to a standard ROS message yourself.
Supported Platforms
| Platform | Backend | Status |
|---|---|---|
| Discrete GPU (GeForce, Quadro, Tesla, A/H series) | CudaIpcBackend |
Supported |
| Jetson Orin (newer CUDA), discrete GPUs | VmmBackend |
Placeholder |
| Jetson Xavier / Orin | NvSciBufBackend / UnifiedMemoryBackend |
Placeholder |
| Jetson Thor (CUDA 13.0+) | CudaIpcBackend |
Placeholder |
| NVIDIA DRIVE | NvSciBufBackend |
Placeholder |
| MIG partitions | — | Not supported (NVIDIA limitation) |
Constraints
cudaMalloconly —cudaMallocManagedandcudaMallocHostdo not support CUDA IPC handles- Same GPU device required for publisher and subscriber
cudaIpcOpenMemHandlelatency: ~50-200us per message (cacheable in future)- Subscriber shared memory is read-only — do not write to any
msg->field
There was a problem hiding this comment.
PR Overview: Add CUDA IPC support to Agnocast
Cross-process zero-copy GPU buffer sharing, with zero build-time CUDA dependency.
What this PR does
Adds a new agnocast_cuda package that enables publishers and subscribers to share GPU device memory across processes via CUDA IPC — with the same ipc_shared_ptr API used for CPU messages. The CUDA runtime is loaded via dlopen at runtime, so the entire package builds as pure C++17 and can be distributed as a pre-built deb from the ROS build farm without a CUDA toolchain.
Architecture
1. agnocast_cuda package (new)
cudart_loader.hpp— singleton thatdlopenslibcudart.soon first use; replicates stable-ABI CUDA types so no CUDA headers are needed at build time. Fallback chain:.so->.so.12->.so.11.0. Safe on non-CUDA systems (never triggered unless a CUDA message type is used).cuda_ipc_backend.cpp— production backend for discrete GPUs:cudaIpcGetMemHandle(export),cudaIpcOpenMemHandle(import),cudaIpcCloseMemHandle(release),cudaFree(reclaim). All calls go throughCudartLoader::instance().get_backend.cpp— queriescudaDevAttrIntegratedto select backend. Discrete GPU ->CudaIpcBackend. Integrated GPU (Jetson/DRIVE) -> placeholder backends (VmmBackend,NvSciBufBackend,UnifiedMemoryBackend).message_types.hpp—agnocast::cuda::PointCloud2andagnocast::cuda::Imagemessage types (inherit from ROS msg +cuda_message_tag, shadowdatafield with a GPU device pointer).
2. agnocastlib integration
cuda_message_tag.hpp—is_cuda_message_v<T>compile-time trait,get_cuda_gpu_data_size<T>()template.gpu_transfer_backend.hpp— abstractGpuTransferBackendinterface +get_backend()declaration (defined in agnocast_cuda, resolved at link time).gpu_metadata.hpp—GpuHandle(64-byte opaque) +GpuMetadatastruct stored in shared memory alongside messages.agnocast_publisher.hpp— publish path: exports GPU handle via backend, storesGpuMetadatain shared memory. Reclaim path:cudaFreevia backend.agnocast_callback_info.hpp— subscriber path: imports GPU handle via backend, stores local GPU pointer inipc_shared_ptrcontrol block.agnocast_smart_pointer.hpp—gpu_data()accessor,gpu_release_fncleanup on last reference (GPU unmap before bitmap release).agnocast_bridge_node.hpp— CUDA message types are excluded from bridging with a compile-time guard and warning.
3. Sample application and packaging
cuda_publisher.cpp— allocates GPU buffer viacudaMalloc, launches CUDA kernel, publishes via Agnocast.cuda_subscriber.cpp— receives message, accesses GPU data viamsg.gpu_data(), verifies withcudaMemcpyD->H.- CMakeLists uses
check_language(CUDA) + CMAKE_CUDA_COMPILER AND agnocast_cuda_FOUNDpattern. - Meta-package
agnocastnow includesagnocast_cudaas exec_depend (safe on non-CUDA systems).
Message flow
Publisher Kernel Module Subscriber
| | |
| cudaMalloc -> msg->data | |
| fill_kernel<<<>>> | |
| publish() | |
| export_handle (cudaIpcGet...) | |
| store GpuMetadata in shmem | |
| publish ioctl ------------------>| entry + bitmap |
| mq_send ------------------------>|---------- epoll wakeup ------>|
| |<-- receive ioctl -------------|
| | import_handle |
| | (cudaIpcOpen...) |
| | gpu_data() --> |
| | kernel<<<>>> |
| | |
| |<-- release (last ref) --------|
| | cudaIpcCloseMemHandle |
| reclaim (bitmap=0) <-------------->| |
| cudaFree(publisher_gpu_ptr) | |
Key design decisions
- No build-time CUDA dependency:
libagnocast_cuda.sohas zero DT_NEEDED on libcudart. Verified: no CUDA symbols, no RPATH to CUDA dirs. - Lazy loading:
CudartLoaderis only triggered when a CUDA message type is actually published/subscribed. Non-CUDA users are completely unaffected. - ABI-stable type replicas:
cudaIpcMemHandle_t(64 bytes),cudaDevAttrIntegrated(= 18), etc. — verified against CUDA 12.8 headers ondriver_types.h. - Same lifetime model: GPU buffer cleanup piggybacks on the existing kernel-bitmap +
ipc_shared_ptrreference counting. No new lifetime API.
Verification
- Full workspace build (11 packages) with zero CUDA on PATH: pass
libagnocast_cuda.soldd/nm/readelf: no CUDA deps,dlopen/dlsympresent: pass- Downstream
find_package(agnocast_cuda)+ target resolution: pass - Sample app skips CUDA binaries when no CUDA compiler: pass
- End-to-end
cuda_talker+cuda_listenerat 10Hz with data verification: pass - clang-format: pass
Signed-off-by: sykwer <sykwer@gmail.com>
Signed-off-by: sykwer <sykwer@gmail.com>
Signed-off-by: sykwer <sykwer@gmail.com>
Signed-off-by: sykwer <sykwer@gmail.com>
Signed-off-by: sykwer <sykwer@gmail.com>
Signed-off-by: sykwer <sykwer@gmail.com>
Signed-off-by: sykwer <sykwer@gmail.com>
Signed-off-by: sykwer <sykwer@gmail.com>
Signed-off-by: sykwer <sykwer@gmail.com>
Coverage Report (humble) |
Coverage Report (jazzy) |
|
stop merge. |
|
We want agnocast_cuda (including libagnocast_cuda.so) to be a pre-built deb distributed from the ROS build farm — but the build farm doesn't have CUDA. That means the .cpp files that call CUDA runtime APIs (cudaIpcGetMemHandle, cudaMalloc, etc.) can't be compiled there at all. The solution: dlopen libcudart.so at runtime instead of linking against it at build time. Then agnocast_cuda has zero build-time CUDA dependency — it's pure C++ that loads the CUDA runtime dynamically |
# Conflicts: # src/agnocastlib/include/agnocast/agnocast_smart_pointer.hpp
|
Ready for reviewed |
Coverage Report (humble) |
Coverage Report (jazzy) |
Coverage Report (humble) |
Coverage Report (jazzy) |
Coverage Report (jazzy) |
Coverage Report (humble) |
| if (!raw_ptr->data) { | ||
| RCLCPP_ERROR( | ||
| logger, | ||
| "CUDA message on topic '%s' has null data pointer. " | ||
| "Did you forget to cudaMalloc(&msg->data, size) before publish()?", | ||
| topic_name_.c_str()); | ||
| std::abort(); | ||
| } |
There was a problem hiding this comment.
Pointers allocated by cudaMallocManaged() can't be used for CUDA IPC.
__host__cudaError_t cudaPointerGetAttributes ( cudaPointerAttributes* attributes, const void* ptr ) can be used for judging whether the given pointer 'ptr' has been allocated by cudaMalloc() (or cudaMallocPitch()) .
Unforunately, pointers allocated by cudaMallocAsync() can't be distinguished, while those pointers cannot be shared among multiple processes without sharing the whole memory pool behind them (not recommended for safety and other reasons).
For this problem, there are two options.
- Use CUDA driver API cuPointerGetAttribute() with CU_POINTER_ATTRIBUTE_MEMPOOL_HANDLE to detect the error earlier and report it.
// Mandatory and low cost.
// Calling once per process after CUDA initialization is enough.
// Can be called multiple times with low cost.
cuInit(0);
CUcontext ctx;
cuCtxGetCurrent(&ctx);
if (ctx) {
CUmemPool pool = NULL;
CUresult res = cuPointerGetAttribute(&pool, CU_POINTER_ATTRIBUTE_MEMPOOL_HANDLE, (CUdeviceptr)ptr);
if (res == CUDA_SUCCESS && pool != NULL) {
// Ptr is a pointer allocated by cudaMallocAsync() from the pool.
}
}
- Do not mind about it here. Anyway, cudaIpcGetMemHandle() will fail later with inappropriate pointers.
- The reason of failure reported will be less clear.
By the way, I think the documentation about usable pointer types needs clarification.
- Pointers allocated by cudaMemAlloc() can be used.
- Also pointers allocated by cudaMemAllocPitch() should be allowed, I think.
- Good for handling 2D "Image" data.
- Pointers allocated by cudaMemAllocAsync() cannot be used.
- Pointers allocated by cudaMallocHost() cannot be used.
- Other cudaMallocXXX() functions return pointers to dedicated types and those pointers cannnot be used.
- C++ compilers allow implicit conversion from 'MyType *' to 'void *' without warning.
Description
Documentation for Users
Related links
How was this PR tested?
Correctly working at https://github.com/tier4/pilot-auto.x2/blob/apply_agnocast_for_v4311/docs/agnocast_migration_status.md
bash scripts/test/e2e_test_1to1.bash(required)bash scripts/test/e2e_test_2to2.bash(required)Notes for reviewers
Version Update Label (Required)
Please add exactly one of the following labels to this PR:
need-major-update: User API breaking changesneed-minor-update: Internal API breaking changes (heaphook/kmod/agnocastlib compatibility)need-patch-update: Bug fixes and other changesImportant notes:
need-major-updateorneed-minor-update, please include this in the PR title as well.fix(foo)[needs major version update]: barorfeat(baz)[needs minor version update]: quxrun-build-testlabel. The PR can only be merged after the build tests pass.See CONTRIBUTING.md for detailed versioning rules.