feat(ptv3)!: precompute serialized pooling metadata by mojomex · Pull Request #12727 · autowarefoundation/autoware_universe

mojomex · 2026-06-08T09:22:12Z

Summary

Warning

This PR breaks backward-compatibility with current ONNX files.

Important

Hard dependency: this PR is un-runnable on its own. It requires the companion export PR tier4/AWML#206, which produces an ONNX whose SerializedPooling subgraphs match the new dynamic-input contract described below. Merge/test the two together.

Note

Sorry, this is quite a big PR. I haven't found a good way to split, so maybe going commit-by-commit is best for reviewing. If I should split into library code+unit tests in one PR, followed by integration in PTv3 node, please let me know.

This PR in part supersedes feat(tensorrt_plugins): add GatherSegmentCSR plugin #12717.

PTv3, when implemented as a 1:1 mapping to the Pointcept repo, suffers from data-dependent shapes (N voxels are pooled into M voxels at each SerializedPooling stage). This causes TensorRT to insert trainStations, forcing CPU/GPU sync and memory allocation, increasing latency.

In Nsys, this can be seen as distinct holes in the otherwise full blue GPU utilization bar:

This PR precomputes PTv3 serialized-pooling metadata during preprocessing and feeds it into the TensorRT engine as dynamic inputs, allowing TensorRT to infer all shapes before inference starts, and eliminating all trainStations:

More precisely, each PTv3 encoder pooling stage groups the current voxel set into parent voxels. For the current exported model this is always stride-2 pooling, so the parent key can be computed from the serialized code by dropping the three least-significant interleaved coordinate bits. Those bits represent the lowest x/y/z voxel bit (...xyzxyzxyz), so dropping them is equivalent to integer-dividing the voxel coordinate by (2, 2, 2).

The expensive data-dependent part is discovering the unique parent voxels and the input-to-parent correspondence. Instead of doing that inside the TensorRT engine, preprocessing now sorts these parent keys, detects run starts, and writes the CSR/gather metadata for every pooling stage before enqueue. The engine then receives the already-known N_i and M_i stage sizes as input shapes, and the ONNX pooling subgraph only needs native Gather plus the existing autoware::SegmentCSR plugin.

ONNX / Engine Interface

The exported ONNX must replace each old SerializedPooling subgraph with:

gathered_i = Gather(feature_i, serialized_pooling_i_indices, axis=0)
feature_{i+1} = autoware::SegmentCSR(gathered_i, serialized_pooling_i_indptr, reduce="max")

For each pooling stage i, where N_i is the input voxel count of that stage, M_i is the pooled output voxel count, and O is the number of serialization orders, the engine now expects these additional dynamic inputs:

Input	Type	Shape	Consumer / meaning
`serialized_pooling_i_indices`	`int64`	`[N_i]`	Native ONNX `Gather` indices. Reorders input features into CSR segment order.
`serialized_pooling_i_indptr`	`int64`	`[M_i + 1]`	`autoware::SegmentCSR` row pointer. Defines segment boundaries and output feature shape.
`serialized_pooling_i_cluster`	`int64`	`[N_i]`	Original input voxel index to pooled voxel id mapping.
`serialized_pooling_i_head_indices`	`int64`	`[M_i]`	Representative input voxel index for each pooled voxel.
`serialized_pooling_i_grid_coord`	`int64`	`[M_i, 3]`	Pooled voxel coordinates for downstream PTv3 blocks.
`serialized_pooling_i_serialized_order`	`int64`	`[O, M_i]`	Serialization order over pooled voxels.
`serialized_pooling_i_serialized_inverse`	`int64`	`[O, M_i]`	Inverse serialization order over pooled voxels.

The original model inputs remain:

Input	Type	Shape
`grid_coord`	`int64`	`[N_0, 3]`
`feat`	`float32`	`[N_0, 4]`
`serialized_code`	`int64`	`[O, N_0]`

N_0 is the initial voxel count produced by preprocessing. For stage i, N_i is N_0 for the first stage and M_{i-1} afterwards. The preprocessing code copies only the small vector of stage counts (N_0, M_0, M_1, ...) back to host so TensorRT input shapes can be set before inference; all metadata tensors themselves stay on device.

Changes

Add ml-package parameters for PTv3 serialization orders and pooling strides.
Allocate serialized-pooling metadata buffers once from the configured max voxel count.
Generate per-stage pooling metadata on CUDA during preprocessing.
Bind metadata buffers and set their TensorRT input shapes before inference.
Add a CPU-reference gtest for the ONNX-facing serialized-pooling metadata.

Benchmark Notes

ADM-AL30 with NVIDIA RTX 4000 SFF Ada, PTv3 variants:

Variant	Total ms	Preprocess ms	Inference ms
`ptv3-t18` baseline	`30.439 +/- 0.950`	`1.333`	`29.093`
fused serialized-pooling plugin (#12717)	`22.140 +/- 0.180`	`3.473`	`18.652`
split native `Gather` + existing `SegmentCSR`	`22.739 +/- 0.086`	`3.585`	`19.138`

The split graph has no Unique and no fused pooling plugin. The exported Nsight SQLite text scan found 0 matches for train in the split profile.

Validation

colcon build --packages-up-to autoware_ptv3 --event-handlers console_direct+
colcon test --packages-select autoware_ptv3 --event-handlers console_direct+

Also ran a before/after on real data using the ML packages linked below. Results are exactly identical (as expected).

output.webm

Reviewing this PR

To test inference, see these TIER IV INTERNAL ML packages.

Suggested path: review commit-by-commit. The logical decomposition already exists at commit granularity:

feat: precompute serialized pooling metadata — the CUDA kernel.
fix: bind serialized pooling metadata inputs — TRT alloc, binding, and shape setting.
test: cover serialized pooling metadata — the CPU-reference test.
style: apply pre-commit formatting — noise, skippable.

The key thing to verify is that the metadata producer (commits 1–2) matches the 7-input contract in the ONNX / Engine Interface table above, which is the same contract the exporter targets in tier4/AWML#206.

github-actions · 2026-06-08T09:22:30Z

Thank you for contributing to the Autoware project!

🚧 If your pull request is in progress, switch it to draft mode.

Please ensure:

You've checked our contribution guidelines.
Your PR follows our pull request guidelines.
All required CI checks pass before marking the PR ready for review.

codecov · 2026-06-08T10:29:33Z

Codecov Report

❌ Patch coverage is 0% with 178 lines in your changes missing coverage. Please review.
✅ Project coverage is 0.40%. Comparing base (80d5c9d) to head (8100ee3).
⚠️ Report is 2 commits behind head on main.

Files with missing lines	Patch %	Lines
perception/autoware_ptv3/lib/ptv3_trt.cpp	0.00%	99 Missing ⚠️
.../autoware_ptv3/lib/preprocess/preprocess_kernel.cu	0.00%	61 Missing ⚠️
...utoware_ptv3/include/autoware/ptv3/ptv3_config.hpp	0.00%	15 Missing ⚠️
perception/autoware_ptv3/src/ptv3_node.cpp	0.00%	3 Missing ⚠️

Additional details and impacted files

@@             Coverage Diff             @@
##             main   #12727       +/-   ##
===========================================
- Coverage   19.58%    0.40%   -19.19%     
===========================================
  Files        1903       12     -1891     
  Lines      131382      993   -130389     
  Branches    45956      165    -45791     
===========================================
- Hits        25733        4    -25729     
+ Misses      84684      988    -83696     
+ Partials    20965        1    -20964

Flag	Coverage Δ
full-suite	`0.40% <0.00%> (-19.19%)`	⬇️

☔ View full report in Codecov by Harness.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Adapt the PTv3 ONNX export so each encoder SerializedPooling stage consumes precomputed pooling metadata as graph inputs instead of discovering it in-graph via Unique. This matches the updated inference node in autowarefoundation/autoware_universe#12727: the exported pooling subgraph is now native Gather + autoware::SegmentCSR with no data-dependent Unique. - tools/export.py: build per-stage metadata sample tensors and register them as named dynamic ONNX inputs (serialized_pooling_{i}_{indices,indptr,cluster, head_indices,grid_coord,serialized_order,serialized_inverse}); attach each stage's metadata to its SerializedPooling module before tracing. - SerializedPooling: in export_mode read metadata from the module attribute instead of computing it. Kept off the Point because addict recursively converts a metadata dict into a (coord-less) Point and recurses infinitely. - Add a CPU export/train equivalence test and document the ONNX preprocessing contract in the README. Verified by exporting a j6gen2 lidarseg checkpoint (O=2) in the awml-ptv3 docker: 4 pooling stages x 7 metadata inputs, 4x SegmentCSR, native Gather, no Unique; onnx.checker passes. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

Adapt the PTv3 ONNX export so each encoder SerializedPooling stage consumes precomputed pooling metadata as graph inputs instead of discovering it in-graph via Unique. This matches the updated inference node in autowarefoundation/autoware_universe#12727: the exported pooling subgraph is now native Gather + autoware::SegmentCSR with no data-dependent Unique. - tools/export.py: build per-stage metadata sample tensors and register them as named dynamic ONNX inputs (serialized_pooling_{i}_{indices,indptr,cluster, head_indices,grid_coord,serialized_order,serialized_inverse}). - SerializedPooling: in export_mode read its metadata from the Point. The metadata travels as a SerializedPoolingMeta dataclass (not a dict) so addict stores it verbatim and propagates it across stages -- a plain dict would be recursively converted into a coord-less Point and recurse infinitely. - Add a CPU export/train equivalence test and document the ONNX preprocessing contract in the README. Verified by exporting a j6gen2 lidarseg checkpoint (O=2) in the awml-ptv3 docker: 31 graph inputs (4 pooling stages x 7 metadata inputs), 4x SegmentCSR, native Gather, no Unique; onnx.checker passes; unit test passes. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

Copilot

Pull request overview

This PR updates the autoware_ptv3 pipeline to precompute PTv3 serialized-pooling metadata during CUDA preprocessing and feed it into the TensorRT engine as additional dynamic inputs, enabling TensorRT to infer shapes up-front and avoid data-dependent shape ops inside the engine (reducing sync/allocation “trainStation” behavior).

Changes:

Add a CUDA preprocessing path to generate per-stage serialized-pooling metadata tensors (CSR/gather + downstream serialization inputs).
Update the TensorRT wrapper to allocate/bind serialized-pooling metadata buffers and set their dynamic input shapes per frame.
Plumb new configuration parameters (serialization_orders, pooling_strides) through schema/YAML/node and add a CPU-reference CUDA gtest validating the ONNX-facing metadata contract.

Reviewed changes

Copilot reviewed 11 out of 11 changed files in this pull request and generated 3 comments.

Show a summary per file

File	Description
perception/autoware_ptv3/lib/preprocess/preprocess_kernel.cu	Adds CUDA-side generation of per-stage serialized-pooling metadata using sort/scan to build CSR/gather structures and serialization orders.
perception/autoware_ptv3/include/autoware/ptv3/preprocess/preprocess_kernel.hpp	Exposes the serialized-pooling metadata generation API and stage buffer view struct.
perception/autoware_ptv3/lib/ptv3_trt.cpp	Allocates metadata buffers, binds them as TRT inputs, computes metadata each frame, and sets dynamic shapes before enqueue.
perception/autoware_ptv3/include/autoware/ptv3/ptv3_trt.hpp	Adds members and helper methods for serialized-pooling metadata buffer management and shape binding.
perception/autoware_ptv3/include/autoware/ptv3/ptv3_config.hpp	Adds config fields + validation for `serialization_orders` and `pooling_strides`.
perception/autoware_ptv3/src/ptv3_node.cpp	Declares new ROS parameters and forwards them into `PTv3Config`.
perception/autoware_ptv3/schema/ml_package_ptv3.schema.json	Extends ml-package schema with the new parameters.
perception/autoware_ptv3/config/ml_package_ptv3.param.yaml	Adds default values for the new parameters.
perception/autoware_ptv3/test/serialized_pooling_metadata_test.cpp	Adds a CUDA-backed gtest comparing GPU-generated metadata to a CPU reference implementation.
perception/autoware_ptv3/CMakeLists.txt	Registers and links the new gtest when testing is enabled (and CUDA/TRT are available).
perception/autoware_ptv3/package.xml	Adds the missing `ament_cmake_gtest` test dependency.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

amadeuszsz · 2026-06-12T08:54:25Z

@codex review

chatgpt-codex-connector · 2026-06-12T08:57:28Z

Codex Review: Didn't find any major issues. Chef's kiss.

Reviewed commit: 7404ba4c56

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Signed-off-by: Max SCHMELLER <max.schmeller@tier4.jp>

mojomex · 2026-06-12T10:30:33Z

No trainStations at 33f7272:

Note that this is on a different GPU than PR description, and doesn't include #12555 yet, so inference doesn't look as perfect as above.

amadeuszsz

LGTM!

…on/autoware_universe#12727) * perf: serialized pooling optimization Signed-off-by: Max SCHMELLER <max.schmeller@tier4.jp> * chore: fix rebase Signed-off-by: Max SCHMELLER <max.schmeller@tier4.jp> * style(pre-commit): autofix --------- Signed-off-by: Max SCHMELLER <max.schmeller@tier4.jp> Co-authored-by: pre-commit-ci-lite[bot] <117423508+pre-commit-ci-lite[bot]@users.noreply.github.com>

github-project-automation Bot added this to Software Working Group Jun 8, 2026

github-project-automation Bot moved this to To Triage in Software Working Group Jun 8, 2026

github-actions Bot added the component:perception Advanced sensor data processing and environment understanding. (auto-assigned) label Jun 8, 2026

mojomex mentioned this pull request Jun 8, 2026

feat(ptv3): precompute serialized pooling metadata mojomex/autoware_universe#9

Closed

mojomex self-assigned this Jun 8, 2026

mojomex added the run:build-and-test-differential Mark to enable build-and-test-differential workflow. (used-by-ci) label Jun 8, 2026

mojomex mentioned this pull request Jun 8, 2026

feat(ptv3)!: export precomputed serialized-pooling metadata for ONNX tier4/AWML#206

Merged

mojomex requested a review from Copilot June 9, 2026 07:35

Copilot started reviewing on behalf of mojomex June 9, 2026 07:35 View session

Copilot AI reviewed Jun 9, 2026

View reviewed changes

Comment thread perception/autoware_ptv3/lib/ptv3_trt.cpp

Comment thread perception/autoware_ptv3/schema/ml_package_ptv3.schema.json

Comment thread perception/autoware_ptv3/schema/ml_package_ptv3.schema.json

mojomex force-pushed the codex/ptv3-serialized-pooling-integration branch from 9e4ecbc to 7404ba4 Compare June 9, 2026 10:18

mojomex marked this pull request as ready for review June 9, 2026 12:20

mojomex requested review from amadeuszsz, knzo25 and manato as code owners June 9, 2026 12:20

mojomex requested review from KSeangTan and vividf June 9, 2026 12:21

mojomex changed the title ~~feat(ptv3): precompute serialized pooling metadata~~ feat(ptv3)!: precompute serialized pooling metadata Jun 9, 2026

mojomex requested a review from ktro2828 June 9, 2026 12:24

mojomex added the tag:require-cuda-build-and-test label Jun 9, 2026

mojomex mentioned this pull request Jun 10, 2026

feat(autoware_ptv3): split the backbone and head #12655

Merged

autowarefoundation deleted a comment from chatgpt-codex-connector Bot Jun 12, 2026

amadeuszsz mentioned this pull request Jun 12, 2026

chore(ansible): update ptv3 artifacts autowarefoundation/autoware#7160

Merged

amadeuszsz reviewed Jun 12, 2026

View reviewed changes

Comment thread perception/autoware_ptv3/include/autoware/ptv3/preprocess/preprocess_kernel.hpp Outdated

perf: serialized pooling optimization

0db4178

Signed-off-by: Max SCHMELLER <max.schmeller@tier4.jp>

mojomex force-pushed the codex/ptv3-serialized-pooling-integration branch from 7404ba4 to 0db4178 Compare June 12, 2026 09:56

chore: fix rebase

33f7272

Signed-off-by: Max SCHMELLER <max.schmeller@tier4.jp>

mojomex force-pushed the codex/ptv3-serialized-pooling-integration branch from 4fa0f63 to 33f7272 Compare June 12, 2026 10:23

mojomex requested a review from amadeuszsz June 12, 2026 10:25

style(pre-commit): autofix

8100ee3

amadeuszsz approved these changes Jun 12, 2026

View reviewed changes

amadeuszsz enabled auto-merge (squash) June 12, 2026 10:31

amadeuszsz merged commit 6439690 into autowarefoundation:main Jun 12, 2026
31 of 32 checks passed

github-project-automation Bot moved this from To Triage to Done in Software Working Group Jun 12, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(ptv3)!: precompute serialized pooling metadata#12727

feat(ptv3)!: precompute serialized pooling metadata#12727
amadeuszsz merged 3 commits into
autowarefoundation:mainfrom
mojomex:codex/ptv3-serialized-pooling-integration

mojomex commented Jun 8, 2026 •

edited

Loading

Uh oh!

github-actions Bot commented Jun 8, 2026 •

edited

Loading

Uh oh!

codecov Bot commented Jun 8, 2026 •

edited

Loading

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

amadeuszsz commented Jun 12, 2026

Uh oh!

chatgpt-codex-connector Bot commented Jun 12, 2026

Uh oh!

Uh oh!

mojomex commented Jun 12, 2026 •

edited

Loading

Uh oh!

amadeuszsz left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

mojomex commented Jun 8, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

ONNX / Engine Interface

Changes

Benchmark Notes

Validation

Reviewing this PR

Uh oh!

github-actions Bot commented Jun 8, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

codecov Bot commented Jun 8, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

amadeuszsz commented Jun 12, 2026

Uh oh!

chatgpt-codex-connector Bot commented Jun 12, 2026

Uh oh!

Uh oh!

mojomex commented Jun 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

amadeuszsz left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

mojomex commented Jun 8, 2026 •

edited

Loading

github-actions Bot commented Jun 8, 2026 •

edited

Loading

codecov Bot commented Jun 8, 2026 •

edited

Loading

mojomex commented Jun 12, 2026 •

edited

Loading