feat: Initial Arrow conversion and parquet output by paulgessinger · Pull Request #5410 · acts-project/acts

paulgessinger · 2026-05-05T16:48:56Z

Adds an Apache Arrow / Parquet output path for the examples framework, so event data (particles, sim hits, tracks) can be written to a columnar, sharded Parquet dataset and read back per-event — from both C++ and Python/pyarrow.

What

New ActsPluginArrow plugin wrapping Apache Arrow. Arrow is linked
with hidden visibility so none of its symbols leak across the .so
boundary (enforced by exported-symbol lists). Two visibility-exported,
pybind-friendly handles — ArrowSchemaHandle and ArrowTable — move
schemas and tables across library and language boundaries via the Arrow
C Data Interface, giving ABI-safe pyarrow interop without exposing
arrow's own typeinfo.
Parquet I/O (Examples/Io/Parquet):
- ParquetWriter: sharded dataset, one directory per collection. Events
  are routed to shard files by event_id, so each shard owns a disjoint
  event-id range and footer min/max statistics stay tight — letting the
  reader prune to a single fragment per lookup. Row-group buffering bounds
  peak memory; per-collection schemas are validated on write.
- ParquetReader: dataset reader with per-event lookup via filter
  pushdown + footer-statistics pruning, and added-column schema evolution
  against an optional target schema.
Output converters (Examples/Io/Arrow) for particles, sim hits and
tracks. Each builds a per-event nested table: one row per event, every
field a list<T> whose single element holds that event's values, with
event_id as the outer routing key.
Python bindings for the converters, reader and writer, with the arrow
schema bridged into Python so the producing and consuming sides share one
schema handle.
Wired into full_chain_odd.py (new --output-parquet flag); covered by
test_arrow.py and an ABI-isolation test (test_arrow_isolation.py).

Why

ACTS examples lacked a standard columnar output. Parquet gives compact, typed, schema-stable files that downstream ML/analysis tooling consumes directly, and the C-Data-interface design lets the same buffers be used from pyarrow with no copies and no arrow-symbol clashes.

Blocked by:

github-actions · 2026-05-05T18:35:53Z

📊: Physics performance monitoring for `b74b718`

Full contents

physmon summary

benjaminhuth

I have some comments, but not fully finished the review. One of the main points is the question if we need to bundle the arrow-interface and the Calo EDM together in one PR

benjaminhuth · 2026-05-13T13:12:55Z

Hmm maybe we need to revise the flexibility of the dataset reading: a download of the ColliderML dataset with the colliderml library looks like this:

.
└── CERN__ColliderML-Release-1
    ├── ttbar_pu200_particles
    │   ├── data
    │   │   └── ttbar_pu200_particles
    │   │       └── train-00000-of-01000.parquet
    │   └── metadata.json
    └── ttbar_pu200_tracker_hits
        ├── data
        │   └── ttbar_pu200_tracker_hits
        │       └── train-00000-of-01000.parquet
        └── metadata.json

I think we cannot map this at the moment...

murnanedaniel · 2026-06-01T07:50:33Z

@benjaminhuth @paulgessinger

Hmm maybe we need to revise the flexibility of the dataset reading: a download of the ColliderML dataset with the colliderml library looks like this:

.
└── CERN__ColliderML-Release-1
    ├── ttbar_pu200_particles
    │   ├── data
    │   │   └── ttbar_pu200_particles
    │   │       └── train-00000-of-01000.parquet
    │   └── metadata.json
    └── ttbar_pu200_tracker_hits
        ├── data
        │   └── ttbar_pu200_tracker_hits
        │       └── train-00000-of-01000.parquet
        └── metadata.json

I think we cannot map this at the moment...

Is this a blocker though? One can output to a location then re-arrange as needed. Or do you specifically mean allowing different shard sizes?

andiwand · 2026-06-02T14:52:44Z

This adds quite some complexity, I remember some of it after you explained it to me. Can we summarize this in the PR description? LLM summary would be enough from my side which could also guide the review a bit

benjaminhuth · 2026-06-02T16:02:42Z

What flexibility is missing @benjaminhuth ?

For the record: I think I was not correct, I don't really remember what exact issue I was encountering...

benjaminhuth

LGTM!

paulgessinger · 2026-06-03T13:03:13Z

@andiwand added a PR description now.

Integrates the official Arrow/Parquet base (PR acts-project#5410) from upstream/main. Our branch retains only the ColliderML reader on top: - ArrowUtil: keep flatColumnUInt*/readFlatParquetFile (used by ColliderML) - Parquet CMakeLists: keep ColliderMLInputConverter target block - Python Arrow bindings: keep ColliderMLInputConverter pybind11 declarations - root_file_hashes.txt: keep ColliderML test hashes + upstream strip_space_points rename Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

github-actions Bot added this to the next milestone May 5, 2026

github-actions Bot added Component - Core Affects the Core module Component - Fatras Affects the Fatras module Component - Examples Affects the Examples module Component - Plugins Affects one or more Plugins Event Data Model labels May 5, 2026

paulgessinger force-pushed the feat/arrow-plugin+conversion branch from db8a828 to 1cbaaba Compare May 11, 2026 15:16

benjaminhuth reviewed May 12, 2026

View reviewed changes

github-actions Bot added the Component - Documentation Affects the documentation label May 12, 2026

paulgessinger commented May 12, 2026

View reviewed changes

Comment thread Plugins/Arrow/src/ArrowUtil.cpp

github-actions Bot added the Component - Detray Affects the Detray project label May 12, 2026

paulgessinger commented May 12, 2026

View reviewed changes

Comment thread Detray/tests/unit_tests/cpu/navigation/intersection/helix_intersector.cpp Outdated

paulgessinger commented May 12, 2026

View reviewed changes

Comment thread Examples/Io/Arrow/src/ArrowSimHitOutputConverter.cpp

paulgessinger force-pushed the feat/arrow-plugin+conversion branch from a09f868 to eccb75c Compare May 12, 2026 15:03

paulgessinger added the 🛑 blocked This item is blocked by another item label May 12, 2026

paulgessinger mentioned this pull request May 12, 2026

feat: Add calo hit reading from edm4hep and writing to parquet #5441

Open

paulgessinger force-pushed the feat/arrow-plugin+conversion branch from eccb75c to d890ddd Compare May 12, 2026 16:29

github-actions Bot removed the Component - Detray Affects the Detray project label May 12, 2026

This was referenced Jun 1, 2026

ACTS-native parquet output + validation vs convert_all (draft) OpenDataDetector/ColliderML-Production#41

Merged

acts: build the Arrow/Parquet plugin (native parquet output) OpenDataDetector/sw#3

Open

paulgessinger removed the 🛑 blocked This item is blocked by another item label Jun 2, 2026

paulgessinger added 6 commits June 2, 2026 09:32

feat: Initial arrow/parquet support

4720f3d

experiment with arrow object library

3a09ce1

clean up symbol visibility in wrapper target

ebdde33

make the isolated arrow absorption optional

230727b

add parquet option to full chain odd

080bf22

updated particle arrow schema based on colliderml

2159d6f

paulgessinger added 7 commits June 2, 2026 15:46

add a common "propagate-to-perigee" helper

c08df2f

drop EDM4hep changes

ddabeed

remove --hepmc3 changes

f0ca765

use digitized clusters to look up the global positions

a3670ee

add test coverage for the sim output conversion

6c64e52

m_states -> m_collectionStates

b210175

cleanup

ec13760

andiwand reviewed Jun 2, 2026

View reviewed changes

Comment thread Examples/Framework/src/Utilities/PerigeeParameters.cpp Outdated

Comment thread Examples/Framework/src/Utilities/PerigeeParameters.cpp Outdated

Comment thread Examples/Framework/src/Utilities/PerigeeParameters.cpp Outdated

benjaminhuth reviewed Jun 2, 2026

View reviewed changes

Comment thread Examples/Framework/include/ActsExamples/Utilities/PerigeeParameters.hpp Outdated

paulgessinger added 3 commits June 3, 2026 11:39

remove the common extrapolation

0f6208a

fix missing else

a6f6b87

format

2171c03

benjaminhuth previously approved these changes Jun 3, 2026

View reviewed changes

benjaminhuth reviewed Jun 3, 2026

View reviewed changes

Comment thread Examples/Scripts/Python/full_chain_odd.py

enable --output-root in ML ambi test

f2fcbc7

paulgessinger dismissed benjaminhuth’s stale review via f2fcbc7 June 3, 2026 13:42

benjaminhuth approved these changes Jun 3, 2026

View reviewed changes

benjaminhuth mentioned this pull request Jun 4, 2026

feat: ColliderML reader #5546

Merged

benjaminhuth added the automerge label Jun 4, 2026

Merge branch 'main' into feat/arrow-plugin+conversion

b74b718

kodiakhq Bot merged commit 2aa0947 into acts-project:main Jun 5, 2026
45 checks passed

github-actions Bot removed the automerge label Jun 5, 2026

andiwand modified the milestones: next, v46.8.0 Jun 11, 2026

This was referenced Jun 12, 2026

feat: one row per measurement in Arrow tracker-hit output #5577

Closed

feat: measurement and sim-hit tables for the Arrow/Parquet output #5586

Draft

Uh oh!

Conversation

paulgessinger commented May 5, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What

Why

Uh oh!

github-actions Bot commented May 5, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

📊: Physics performance monitoring for b74b718

physmon summary

Uh oh!

benjaminhuth left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

benjaminhuth commented May 13, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

murnanedaniel commented Jun 1, 2026

Uh oh!

Uh oh!

Uh oh!

Uh oh!

andiwand commented Jun 2, 2026

Uh oh!

Uh oh!

benjaminhuth commented Jun 2, 2026

Uh oh!

benjaminhuth left a comment

Choose a reason for hiding this comment

Uh oh!

paulgessinger commented Jun 3, 2026

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

paulgessinger commented May 5, 2026 •

edited

Loading

github-actions Bot commented May 5, 2026 •

edited

Loading

📊: Physics performance monitoring for `b74b718`

benjaminhuth commented May 13, 2026 •

edited

Loading