feat(ptv3)!: precompute serialized pooling metadata#12727
Conversation
|
Thank you for contributing to the Autoware project! 🚧 If your pull request is in progress, switch it to draft mode. Please ensure:
|
Codecov Report❌ Patch coverage is Additional details and impacted files@@ Coverage Diff @@
## main #12727 +/- ##
===========================================
- Coverage 19.58% 0.40% -19.19%
===========================================
Files 1903 12 -1891
Lines 131382 993 -130389
Branches 45956 165 -45791
===========================================
- Hits 25733 4 -25729
+ Misses 84684 988 -83696
+ Partials 20965 1 -20964
☔ View full report in Codecov by Harness. 🚀 New features to boost your workflow:
|
Adapt the PTv3 ONNX export so each encoder SerializedPooling stage consumes precomputed pooling metadata as graph inputs instead of discovering it in-graph via Unique. This matches the updated inference node in autowarefoundation/autoware_universe#12727: the exported pooling subgraph is now native Gather + autoware::SegmentCSR with no data-dependent Unique. - tools/export.py: build per-stage metadata sample tensors and register them as named dynamic ONNX inputs (serialized_pooling_{i}_{indices,indptr,cluster, head_indices,grid_coord,serialized_order,serialized_inverse}); attach each stage's metadata to its SerializedPooling module before tracing. - SerializedPooling: in export_mode read metadata from the module attribute instead of computing it. Kept off the Point because addict recursively converts a metadata dict into a (coord-less) Point and recurses infinitely. - Add a CPU export/train equivalence test and document the ONNX preprocessing contract in the README. Verified by exporting a j6gen2 lidarseg checkpoint (O=2) in the awml-ptv3 docker: 4 pooling stages x 7 metadata inputs, 4x SegmentCSR, native Gather, no Unique; onnx.checker passes. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Adapt the PTv3 ONNX export so each encoder SerializedPooling stage consumes precomputed pooling metadata as graph inputs instead of discovering it in-graph via Unique. This matches the updated inference node in autowarefoundation/autoware_universe#12727: the exported pooling subgraph is now native Gather + autoware::SegmentCSR with no data-dependent Unique. - tools/export.py: build per-stage metadata sample tensors and register them as named dynamic ONNX inputs (serialized_pooling_{i}_{indices,indptr,cluster, head_indices,grid_coord,serialized_order,serialized_inverse}). - SerializedPooling: in export_mode read its metadata from the Point. The metadata travels as a SerializedPoolingMeta dataclass (not a dict) so addict stores it verbatim and propagates it across stages -- a plain dict would be recursively converted into a coord-less Point and recurse infinitely. - Add a CPU export/train equivalence test and document the ONNX preprocessing contract in the README. Verified by exporting a j6gen2 lidarseg checkpoint (O=2) in the awml-ptv3 docker: 31 graph inputs (4 pooling stages x 7 metadata inputs), 4x SegmentCSR, native Gather, no Unique; onnx.checker passes; unit test passes. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
There was a problem hiding this comment.
Pull request overview
This PR updates the autoware_ptv3 pipeline to precompute PTv3 serialized-pooling metadata during CUDA preprocessing and feed it into the TensorRT engine as additional dynamic inputs, enabling TensorRT to infer shapes up-front and avoid data-dependent shape ops inside the engine (reducing sync/allocation “trainStation” behavior).
Changes:
- Add a CUDA preprocessing path to generate per-stage serialized-pooling metadata tensors (CSR/gather + downstream serialization inputs).
- Update the TensorRT wrapper to allocate/bind serialized-pooling metadata buffers and set their dynamic input shapes per frame.
- Plumb new configuration parameters (
serialization_orders,pooling_strides) through schema/YAML/node and add a CPU-reference CUDA gtest validating the ONNX-facing metadata contract.
Reviewed changes
Copilot reviewed 11 out of 11 changed files in this pull request and generated 3 comments.
Show a summary per file
| File | Description |
|---|---|
| perception/autoware_ptv3/lib/preprocess/preprocess_kernel.cu | Adds CUDA-side generation of per-stage serialized-pooling metadata using sort/scan to build CSR/gather structures and serialization orders. |
| perception/autoware_ptv3/include/autoware/ptv3/preprocess/preprocess_kernel.hpp | Exposes the serialized-pooling metadata generation API and stage buffer view struct. |
| perception/autoware_ptv3/lib/ptv3_trt.cpp | Allocates metadata buffers, binds them as TRT inputs, computes metadata each frame, and sets dynamic shapes before enqueue. |
| perception/autoware_ptv3/include/autoware/ptv3/ptv3_trt.hpp | Adds members and helper methods for serialized-pooling metadata buffer management and shape binding. |
| perception/autoware_ptv3/include/autoware/ptv3/ptv3_config.hpp | Adds config fields + validation for serialization_orders and pooling_strides. |
| perception/autoware_ptv3/src/ptv3_node.cpp | Declares new ROS parameters and forwards them into PTv3Config. |
| perception/autoware_ptv3/schema/ml_package_ptv3.schema.json | Extends ml-package schema with the new parameters. |
| perception/autoware_ptv3/config/ml_package_ptv3.param.yaml | Adds default values for the new parameters. |
| perception/autoware_ptv3/test/serialized_pooling_metadata_test.cpp | Adds a CUDA-backed gtest comparing GPU-generated metadata to a CPU reference implementation. |
| perception/autoware_ptv3/CMakeLists.txt | Registers and links the new gtest when testing is enabled (and CUDA/TRT are available). |
| perception/autoware_ptv3/package.xml | Adds the missing ament_cmake_gtest test dependency. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
9e4ecbc to
7404ba4
Compare
|
@codex review |
|
Codex Review: Didn't find any major issues. Chef's kiss. Reviewed commit: ℹ️ About Codex in GitHubYour team has set up Codex to review pull requests in this repo. Reviews are triggered when you
If Codex has suggestions, it will comment; otherwise it will react with 👍. Codex can also answer questions or update the PR. Try commenting "@codex address that feedback". |
Signed-off-by: Max SCHMELLER <max.schmeller@tier4.jp>
7404ba4 to
0db4178
Compare
Signed-off-by: Max SCHMELLER <max.schmeller@tier4.jp>
4fa0f63 to
33f7272
Compare
…on/autoware_universe#12727) * perf: serialized pooling optimization Signed-off-by: Max SCHMELLER <max.schmeller@tier4.jp> * chore: fix rebase Signed-off-by: Max SCHMELLER <max.schmeller@tier4.jp> * style(pre-commit): autofix --------- Signed-off-by: Max SCHMELLER <max.schmeller@tier4.jp> Co-authored-by: pre-commit-ci-lite[bot] <117423508+pre-commit-ci-lite[bot]@users.noreply.github.com>

Summary
Warning
This PR breaks backward-compatibility with current ONNX files.
Important
Hard dependency: this PR is un-runnable on its own. It requires the companion export PR tier4/AWML#206, which produces an ONNX whose
SerializedPoolingsubgraphs match the new dynamic-input contract described below. Merge/test the two together.Note
Sorry, this is quite a big PR. I haven't found a good way to split, so maybe going commit-by-commit is best for reviewing. If I should split into library code+unit tests in one PR, followed by integration in PTv3 node, please let me know.
PTv3, when implemented as a 1:1 mapping to the Pointcept repo, suffers from data-dependent shapes (
Nvoxels are pooled intoMvoxels at eachSerializedPoolingstage). This causes TensorRT to inserttrainStations, forcing CPU/GPU sync and memory allocation, increasing latency.In Nsys, this can be seen as distinct holes in the otherwise full blue GPU utilization bar:
This PR precomputes PTv3 serialized-pooling metadata during preprocessing and feeds it into the TensorRT engine as dynamic inputs, allowing TensorRT to infer all shapes before inference starts, and eliminating all
trainStations:More precisely, each PTv3 encoder pooling stage groups the current voxel set into parent voxels. For the current exported model this is always stride-2 pooling, so the parent key can be computed from the serialized code by dropping the three least-significant interleaved coordinate bits. Those bits represent the lowest x/y/z voxel bit (
...xyzxyzxyz), so dropping them is equivalent to integer-dividing the voxel coordinate by(2, 2, 2).The expensive data-dependent part is discovering the unique parent voxels and the input-to-parent correspondence. Instead of doing that inside the TensorRT engine, preprocessing now sorts these parent keys, detects run starts, and writes the CSR/gather metadata for every pooling stage before enqueue. The engine then receives the already-known
N_iandM_istage sizes as input shapes, and the ONNX pooling subgraph only needs nativeGatherplus the existingautoware::SegmentCSRplugin.ONNX / Engine Interface
The exported ONNX must replace each old
SerializedPoolingsubgraph with:For each pooling stage
i, whereN_iis the input voxel count of that stage,M_iis the pooled output voxel count, andOis the number of serialization orders, the engine now expects these additional dynamic inputs:serialized_pooling_i_indicesint64[N_i]Gatherindices. Reorders input features into CSR segment order.serialized_pooling_i_indptrint64[M_i + 1]autoware::SegmentCSRrow pointer. Defines segment boundaries and output feature shape.serialized_pooling_i_clusterint64[N_i]serialized_pooling_i_head_indicesint64[M_i]serialized_pooling_i_grid_coordint64[M_i, 3]serialized_pooling_i_serialized_orderint64[O, M_i]serialized_pooling_i_serialized_inverseint64[O, M_i]The original model inputs remain:
grid_coordint64[N_0, 3]featfloat32[N_0, 4]serialized_codeint64[O, N_0]N_0is the initial voxel count produced by preprocessing. For stagei,N_iisN_0for the first stage andM_{i-1}afterwards. The preprocessing code copies only the small vector of stage counts (N_0, M_0, M_1, ...) back to host so TensorRT input shapes can be set before inference; all metadata tensors themselves stay on device.Changes
Benchmark Notes
ADM-AL30 with NVIDIA RTX 4000 SFF Ada, PTv3 variants:
ptv3-t18baseline30.439 +/- 0.9501.33329.09322.140 +/- 0.1803.47318.652Gather+ existingSegmentCSR22.739 +/- 0.0863.58519.138The split graph has no
Uniqueand no fused pooling plugin. The exported Nsight SQLite text scan found0matches fortrainin the split profile.Validation
colcon build --packages-up-to autoware_ptv3 --event-handlers console_direct+ colcon test --packages-select autoware_ptv3 --event-handlers console_direct+Also ran a before/after on real data using the ML packages linked below. Results are exactly identical (as expected).
output.webm
Reviewing this PR
To test inference, see these TIER IV INTERNAL ML packages.
Suggested path: review commit-by-commit. The logical decomposition already exists at commit granularity:
feat: precompute serialized pooling metadata— the CUDA kernel.fix: bind serialized pooling metadata inputs— TRT alloc, binding, and shape setting.test: cover serialized pooling metadata— the CPU-reference test.style: apply pre-commit formatting— noise, skippable.The key thing to verify is that the metadata producer (commits 1–2) matches the 7-input contract in the ONNX / Engine Interface table above, which is the same contract the exporter targets in tier4/AWML#206.