Skip to content

feat(ptv3)!: precompute serialized pooling metadata#12727

Merged
amadeuszsz merged 3 commits into
autowarefoundation:mainfrom
mojomex:codex/ptv3-serialized-pooling-integration
Jun 12, 2026
Merged

feat(ptv3)!: precompute serialized pooling metadata#12727
amadeuszsz merged 3 commits into
autowarefoundation:mainfrom
mojomex:codex/ptv3-serialized-pooling-integration

Conversation

@mojomex

@mojomex mojomex commented Jun 8, 2026

Copy link
Copy Markdown
Contributor

Summary

Warning

This PR breaks backward-compatibility with current ONNX files.

Important

Hard dependency: this PR is un-runnable on its own. It requires the companion export PR tier4/AWML#206, which produces an ONNX whose SerializedPooling subgraphs match the new dynamic-input contract described below. Merge/test the two together.

Note

Sorry, this is quite a big PR. I haven't found a good way to split, so maybe going commit-by-commit is best for reviewing. If I should split into library code+unit tests in one PR, followed by integration in PTv3 node, please let me know.

PTv3, when implemented as a 1:1 mapping to the Pointcept repo, suffers from data-dependent shapes (N voxels are pooled into M voxels at each SerializedPooling stage). This causes TensorRT to insert trainStations, forcing CPU/GPU sync and memory allocation, increasing latency.

In Nsys, this can be seen as distinct holes in the otherwise full blue GPU utilization bar:

image

This PR precomputes PTv3 serialized-pooling metadata during preprocessing and feeds it into the TensorRT engine as dynamic inputs, allowing TensorRT to infer all shapes before inference starts, and eliminating all trainStations:

image

More precisely, each PTv3 encoder pooling stage groups the current voxel set into parent voxels. For the current exported model this is always stride-2 pooling, so the parent key can be computed from the serialized code by dropping the three least-significant interleaved coordinate bits. Those bits represent the lowest x/y/z voxel bit (...xyzxyzxyz), so dropping them is equivalent to integer-dividing the voxel coordinate by (2, 2, 2).

The expensive data-dependent part is discovering the unique parent voxels and the input-to-parent correspondence. Instead of doing that inside the TensorRT engine, preprocessing now sorts these parent keys, detects run starts, and writes the CSR/gather metadata for every pooling stage before enqueue. The engine then receives the already-known N_i and M_i stage sizes as input shapes, and the ONNX pooling subgraph only needs native Gather plus the existing autoware::SegmentCSR plugin.

ONNX / Engine Interface

The exported ONNX must replace each old SerializedPooling subgraph with:

gathered_i = Gather(feature_i, serialized_pooling_i_indices, axis=0)
feature_{i+1} = autoware::SegmentCSR(gathered_i, serialized_pooling_i_indptr, reduce="max")

For each pooling stage i, where N_i is the input voxel count of that stage, M_i is the pooled output voxel count, and O is the number of serialization orders, the engine now expects these additional dynamic inputs:

Input Type Shape Consumer / meaning
serialized_pooling_i_indices int64 [N_i] Native ONNX Gather indices. Reorders input features into CSR segment order.
serialized_pooling_i_indptr int64 [M_i + 1] autoware::SegmentCSR row pointer. Defines segment boundaries and output feature shape.
serialized_pooling_i_cluster int64 [N_i] Original input voxel index to pooled voxel id mapping.
serialized_pooling_i_head_indices int64 [M_i] Representative input voxel index for each pooled voxel.
serialized_pooling_i_grid_coord int64 [M_i, 3] Pooled voxel coordinates for downstream PTv3 blocks.
serialized_pooling_i_serialized_order int64 [O, M_i] Serialization order over pooled voxels.
serialized_pooling_i_serialized_inverse int64 [O, M_i] Inverse serialization order over pooled voxels.

The original model inputs remain:

Input Type Shape
grid_coord int64 [N_0, 3]
feat float32 [N_0, 4]
serialized_code int64 [O, N_0]

N_0 is the initial voxel count produced by preprocessing. For stage i, N_i is N_0 for the first stage and M_{i-1} afterwards. The preprocessing code copies only the small vector of stage counts (N_0, M_0, M_1, ...) back to host so TensorRT input shapes can be set before inference; all metadata tensors themselves stay on device.

Changes

  • Add ml-package parameters for PTv3 serialization orders and pooling strides.
  • Allocate serialized-pooling metadata buffers once from the configured max voxel count.
  • Generate per-stage pooling metadata on CUDA during preprocessing.
  • Bind metadata buffers and set their TensorRT input shapes before inference.
  • Add a CPU-reference gtest for the ONNX-facing serialized-pooling metadata.

Benchmark Notes

ADM-AL30 with NVIDIA RTX 4000 SFF Ada, PTv3 variants:

Variant Total ms Preprocess ms Inference ms
ptv3-t18 baseline 30.439 +/- 0.950 1.333 29.093
fused serialized-pooling plugin (#12717) 22.140 +/- 0.180 3.473 18.652
split native Gather + existing SegmentCSR 22.739 +/- 0.086 3.585 19.138

The split graph has no Unique and no fused pooling plugin. The exported Nsight SQLite text scan found 0 matches for train in the split profile.

Validation

colcon build --packages-up-to autoware_ptv3 --event-handlers console_direct+
colcon test --packages-select autoware_ptv3 --event-handlers console_direct+

Also ran a before/after on real data using the ML packages linked below. Results are exactly identical (as expected).

output.webm

Reviewing this PR

To test inference, see these TIER IV INTERNAL ML packages.

Suggested path: review commit-by-commit. The logical decomposition already exists at commit granularity:

  1. feat: precompute serialized pooling metadata — the CUDA kernel.
  2. fix: bind serialized pooling metadata inputs — TRT alloc, binding, and shape setting.
  3. test: cover serialized pooling metadata — the CPU-reference test.
  4. style: apply pre-commit formatting — noise, skippable.

The key thing to verify is that the metadata producer (commits 1–2) matches the 7-input contract in the ONNX / Engine Interface table above, which is the same contract the exporter targets in tier4/AWML#206.

@github-actions github-actions Bot added the component:perception Advanced sensor data processing and environment understanding. (auto-assigned) label Jun 8, 2026
@github-actions

github-actions Bot commented Jun 8, 2026

Copy link
Copy Markdown

Thank you for contributing to the Autoware project!

🚧 If your pull request is in progress, switch it to draft mode.

Please ensure:

@mojomex mojomex self-assigned this Jun 8, 2026
@mojomex mojomex added the run:build-and-test-differential Mark to enable build-and-test-differential workflow. (used-by-ci) label Jun 8, 2026
@codecov

codecov Bot commented Jun 8, 2026

Copy link
Copy Markdown

Codecov Report

❌ Patch coverage is 0% with 178 lines in your changes missing coverage. Please review.
✅ Project coverage is 0.40%. Comparing base (80d5c9d) to head (8100ee3).
⚠️ Report is 2 commits behind head on main.

Files with missing lines Patch % Lines
perception/autoware_ptv3/lib/ptv3_trt.cpp 0.00% 99 Missing ⚠️
.../autoware_ptv3/lib/preprocess/preprocess_kernel.cu 0.00% 61 Missing ⚠️
...utoware_ptv3/include/autoware/ptv3/ptv3_config.hpp 0.00% 15 Missing ⚠️
perception/autoware_ptv3/src/ptv3_node.cpp 0.00% 3 Missing ⚠️
Additional details and impacted files
@@             Coverage Diff             @@
##             main   #12727       +/-   ##
===========================================
- Coverage   19.58%    0.40%   -19.19%     
===========================================
  Files        1903       12     -1891     
  Lines      131382      993   -130389     
  Branches    45956      165    -45791     
===========================================
- Hits        25733        4    -25729     
+ Misses      84684      988    -83696     
+ Partials    20965        1    -20964     
Flag Coverage Δ
full-suite 0.40% <0.00%> (-19.19%) ⬇️

☔ View full report in Codecov by Harness.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

mojomex added a commit to tier4/AWML that referenced this pull request Jun 8, 2026
Adapt the PTv3 ONNX export so each encoder SerializedPooling stage consumes
precomputed pooling metadata as graph inputs instead of discovering it in-graph
via Unique. This matches the updated inference node in
autowarefoundation/autoware_universe#12727: the exported pooling subgraph is now
native Gather + autoware::SegmentCSR with no data-dependent Unique.

- tools/export.py: build per-stage metadata sample tensors and register them as
  named dynamic ONNX inputs (serialized_pooling_{i}_{indices,indptr,cluster,
  head_indices,grid_coord,serialized_order,serialized_inverse}); attach each
  stage's metadata to its SerializedPooling module before tracing.
- SerializedPooling: in export_mode read metadata from the module attribute
  instead of computing it. Kept off the Point because addict recursively
  converts a metadata dict into a (coord-less) Point and recurses infinitely.
- Add a CPU export/train equivalence test and document the ONNX preprocessing
  contract in the README.

Verified by exporting a j6gen2 lidarseg checkpoint (O=2) in the awml-ptv3
docker: 4 pooling stages x 7 metadata inputs, 4x SegmentCSR, native Gather, no
Unique; onnx.checker passes.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
mojomex added a commit to tier4/AWML that referenced this pull request Jun 8, 2026
Adapt the PTv3 ONNX export so each encoder SerializedPooling stage consumes
precomputed pooling metadata as graph inputs instead of discovering it in-graph
via Unique. This matches the updated inference node in
autowarefoundation/autoware_universe#12727: the exported pooling subgraph is now
native Gather + autoware::SegmentCSR with no data-dependent Unique.

- tools/export.py: build per-stage metadata sample tensors and register them as
  named dynamic ONNX inputs (serialized_pooling_{i}_{indices,indptr,cluster,
  head_indices,grid_coord,serialized_order,serialized_inverse}).
- SerializedPooling: in export_mode read its metadata from the Point. The
  metadata travels as a SerializedPoolingMeta dataclass (not a dict) so addict
  stores it verbatim and propagates it across stages -- a plain dict would be
  recursively converted into a coord-less Point and recurse infinitely.
- Add a CPU export/train equivalence test and document the ONNX preprocessing
  contract in the README.

Verified by exporting a j6gen2 lidarseg checkpoint (O=2) in the awml-ptv3
docker: 31 graph inputs (4 pooling stages x 7 metadata inputs), 4x SegmentCSR,
native Gather, no Unique; onnx.checker passes; unit test passes.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@mojomex mojomex requested a review from Copilot June 9, 2026 07:35

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR updates the autoware_ptv3 pipeline to precompute PTv3 serialized-pooling metadata during CUDA preprocessing and feed it into the TensorRT engine as additional dynamic inputs, enabling TensorRT to infer shapes up-front and avoid data-dependent shape ops inside the engine (reducing sync/allocation “trainStation” behavior).

Changes:

  • Add a CUDA preprocessing path to generate per-stage serialized-pooling metadata tensors (CSR/gather + downstream serialization inputs).
  • Update the TensorRT wrapper to allocate/bind serialized-pooling metadata buffers and set their dynamic input shapes per frame.
  • Plumb new configuration parameters (serialization_orders, pooling_strides) through schema/YAML/node and add a CPU-reference CUDA gtest validating the ONNX-facing metadata contract.

Reviewed changes

Copilot reviewed 11 out of 11 changed files in this pull request and generated 3 comments.

Show a summary per file
File Description
perception/autoware_ptv3/lib/preprocess/preprocess_kernel.cu Adds CUDA-side generation of per-stage serialized-pooling metadata using sort/scan to build CSR/gather structures and serialization orders.
perception/autoware_ptv3/include/autoware/ptv3/preprocess/preprocess_kernel.hpp Exposes the serialized-pooling metadata generation API and stage buffer view struct.
perception/autoware_ptv3/lib/ptv3_trt.cpp Allocates metadata buffers, binds them as TRT inputs, computes metadata each frame, and sets dynamic shapes before enqueue.
perception/autoware_ptv3/include/autoware/ptv3/ptv3_trt.hpp Adds members and helper methods for serialized-pooling metadata buffer management and shape binding.
perception/autoware_ptv3/include/autoware/ptv3/ptv3_config.hpp Adds config fields + validation for serialization_orders and pooling_strides.
perception/autoware_ptv3/src/ptv3_node.cpp Declares new ROS parameters and forwards them into PTv3Config.
perception/autoware_ptv3/schema/ml_package_ptv3.schema.json Extends ml-package schema with the new parameters.
perception/autoware_ptv3/config/ml_package_ptv3.param.yaml Adds default values for the new parameters.
perception/autoware_ptv3/test/serialized_pooling_metadata_test.cpp Adds a CUDA-backed gtest comparing GPU-generated metadata to a CPU reference implementation.
perception/autoware_ptv3/CMakeLists.txt Registers and links the new gtest when testing is enabled (and CUDA/TRT are available).
perception/autoware_ptv3/package.xml Adds the missing ament_cmake_gtest test dependency.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread perception/autoware_ptv3/lib/ptv3_trt.cpp
Comment thread perception/autoware_ptv3/schema/ml_package_ptv3.schema.json
Comment thread perception/autoware_ptv3/schema/ml_package_ptv3.schema.json
@mojomex mojomex force-pushed the codex/ptv3-serialized-pooling-integration branch from 9e4ecbc to 7404ba4 Compare June 9, 2026 10:18
@mojomex mojomex marked this pull request as ready for review June 9, 2026 12:20
@mojomex mojomex requested review from KSeangTan and vividf June 9, 2026 12:21
@mojomex mojomex changed the title feat(ptv3): precompute serialized pooling metadata feat(ptv3)!: precompute serialized pooling metadata Jun 9, 2026
@mojomex mojomex requested a review from ktro2828 June 9, 2026 12:24
@amadeuszsz

Copy link
Copy Markdown
Contributor

@codex review

@chatgpt-codex-connector

Copy link
Copy Markdown

Codex Review: Didn't find any major issues. Chef's kiss.

Reviewed commit: 7404ba4c56

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread perception/autoware_ptv3/include/autoware/ptv3/preprocess/preprocess_kernel.hpp Outdated
Signed-off-by: Max SCHMELLER <max.schmeller@tier4.jp>
@mojomex mojomex force-pushed the codex/ptv3-serialized-pooling-integration branch from 7404ba4 to 0db4178 Compare June 12, 2026 09:56
Signed-off-by: Max SCHMELLER <max.schmeller@tier4.jp>
@mojomex mojomex force-pushed the codex/ptv3-serialized-pooling-integration branch from 4fa0f63 to 33f7272 Compare June 12, 2026 10:23
@mojomex mojomex requested a review from amadeuszsz June 12, 2026 10:25
@mojomex

mojomex commented Jun 12, 2026

Copy link
Copy Markdown
Contributor Author

No trainStations at 33f7272:

image

Note that this is on a different GPU than PR description, and doesn't include #12555 yet, so inference doesn't look as perfect as above.

@amadeuszsz amadeuszsz left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

@amadeuszsz amadeuszsz enabled auto-merge (squash) June 12, 2026 10:31
@amadeuszsz amadeuszsz merged commit 6439690 into autowarefoundation:main Jun 12, 2026
31 of 32 checks passed
@github-project-automation github-project-automation Bot moved this from To Triage to Done in Software Working Group Jun 12, 2026
tier4-autoware-public-bot Bot pushed a commit to tier4/autoware_universe_perception that referenced this pull request Jun 12, 2026
…on/autoware_universe#12727)

* perf: serialized pooling optimization

Signed-off-by: Max SCHMELLER <max.schmeller@tier4.jp>

* chore: fix rebase

Signed-off-by: Max SCHMELLER <max.schmeller@tier4.jp>

* style(pre-commit): autofix

---------

Signed-off-by: Max SCHMELLER <max.schmeller@tier4.jp>
Co-authored-by: pre-commit-ci-lite[bot] <117423508+pre-commit-ci-lite[bot]@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

component:perception Advanced sensor data processing and environment understanding. (auto-assigned) run:build-and-test-differential Mark to enable build-and-test-differential workflow. (used-by-ci) tag:require-cuda-build-and-test

Projects

Status: Done

Development

Successfully merging this pull request may close these issues.

3 participants