Skip to content

feat: Initial Arrow conversion and parquet output#5410

Merged
kodiakhq[bot] merged 49 commits into
acts-project:mainfrom
paulgessinger:feat/arrow-plugin+conversion
Jun 5, 2026
Merged

feat: Initial Arrow conversion and parquet output#5410
kodiakhq[bot] merged 49 commits into
acts-project:mainfrom
paulgessinger:feat/arrow-plugin+conversion

Conversation

@paulgessinger

@paulgessinger paulgessinger commented May 5, 2026

Copy link
Copy Markdown
Member

Adds an Apache Arrow / Parquet output path for the examples framework, so event data (particles, sim hits, tracks) can be written to a columnar, sharded Parquet dataset and read back per-event — from both C++ and Python/pyarrow.

What

  • New ActsPluginArrow plugin wrapping Apache Arrow. Arrow is linked
    with hidden visibility so none of its symbols leak across the .so
    boundary (enforced by exported-symbol lists). Two visibility-exported,
    pybind-friendly handles — ArrowSchemaHandle and ArrowTable — move
    schemas and tables across library and language boundaries via the Arrow
    C Data Interface, giving ABI-safe pyarrow interop without exposing
    arrow's own typeinfo.

  • Parquet I/O (Examples/Io/Parquet):

    • ParquetWriter: sharded dataset, one directory per collection. Events
      are routed to shard files by event_id, so each shard owns a disjoint
      event-id range and footer min/max statistics stay tight — letting the
      reader prune to a single fragment per lookup. Row-group buffering bounds
      peak memory; per-collection schemas are validated on write.
    • ParquetReader: dataset reader with per-event lookup via filter
      pushdown + footer-statistics pruning, and added-column schema evolution
      against an optional target schema.
  • Output converters (Examples/Io/Arrow) for particles, sim hits and
    tracks. Each builds a per-event nested table: one row per event, every
    field a list<T> whose single element holds that event's values, with
    event_id as the outer routing key.

  • Python bindings for the converters, reader and writer, with the arrow
    schema bridged into Python so the producing and consuming sides share one
    schema handle.

  • Wired into full_chain_odd.py (new --output-parquet flag); covered by
    test_arrow.py and an ABI-isolation test (test_arrow_isolation.py).

Why

ACTS examples lacked a standard columnar output. Parquet gives compact, typed, schema-stable files that downstream ML/analysis tooling consumes directly, and the C-Data-interface design lets the same buffers be used from pyarrow with no copies and no arrow-symbol clashes.

Blocked by:

@github-actions github-actions Bot added this to the next milestone May 5, 2026
@github-actions github-actions Bot added Component - Core Affects the Core module Component - Fatras Affects the Fatras module Component - Examples Affects the Examples module Component - Plugins Affects one or more Plugins Event Data Model labels May 5, 2026
@github-actions

github-actions Bot commented May 5, 2026

Copy link
Copy Markdown
Contributor

📊: Physics performance monitoring for b74b718

Full contents

physmon summary

@paulgessinger paulgessinger force-pushed the feat/arrow-plugin+conversion branch from db8a828 to 1cbaaba Compare May 11, 2026 15:16

@benjaminhuth benjaminhuth left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have some comments, but not fully finished the review. One of the main points is the question if we need to bundle the arrow-interface and the Calo EDM together in one PR

Comment thread Examples/Framework/include/ActsExamples/EventData/CaloHit.hpp Outdated
Comment thread Examples/Framework/include/ActsExamples/EventData/CaloHit.hpp Outdated
Comment thread Examples/Framework/include/ActsExamples/EventData/CaloHit.hpp Outdated
Comment thread plans/edm4hep_sim_input_converter_perf.md Outdated
Comment thread Plugins/Arrow/src/ArrowUtil.cpp Outdated
Comment thread Plugins/Arrow/include/ActsPlugins/Arrow/ArrowUtil.hpp
Comment thread Examples/Io/Parquet/src/ParquetWriter.cpp Outdated
Comment thread Examples/Io/Parquet/src/ParquetWriter.cpp Outdated
@github-actions github-actions Bot added the Component - Documentation Affects the documentation label May 12, 2026
Comment thread Plugins/Arrow/src/ArrowUtil.cpp
@github-actions github-actions Bot added the Component - Detray Affects the Detray project label May 12, 2026
Comment thread Detray/tests/unit_tests/cpu/navigation/intersection/helix_intersector.cpp Outdated
Comment thread Examples/Io/Arrow/src/ArrowSimHitOutputConverter.cpp
@paulgessinger paulgessinger force-pushed the feat/arrow-plugin+conversion branch from a09f868 to eccb75c Compare May 12, 2026 15:03
@paulgessinger paulgessinger added the 🛑 blocked This item is blocked by another item label May 12, 2026
@paulgessinger paulgessinger force-pushed the feat/arrow-plugin+conversion branch from eccb75c to d890ddd Compare May 12, 2026 16:29
@github-actions github-actions Bot removed the Component - Detray Affects the Detray project label May 12, 2026
@benjaminhuth

benjaminhuth commented May 13, 2026

Copy link
Copy Markdown
Member

Hmm maybe we need to revise the flexibility of the dataset reading: a download of the ColliderML dataset with the colliderml library looks like this:

.
└── CERN__ColliderML-Release-1
    ├── ttbar_pu200_particles
    │   ├── data
    │   │   └── ttbar_pu200_particles
    │   │       └── train-00000-of-01000.parquet
    │   └── metadata.json
    └── ttbar_pu200_tracker_hits
        ├── data
        │   └── ttbar_pu200_tracker_hits
        │       └── train-00000-of-01000.parquet
        └── metadata.json

I think we cannot map this at the moment...

@murnanedaniel

Copy link
Copy Markdown
Contributor

@benjaminhuth @paulgessinger

Hmm maybe we need to revise the flexibility of the dataset reading: a download of the ColliderML dataset with the colliderml library looks like this:

.
└── CERN__ColliderML-Release-1
    ├── ttbar_pu200_particles
    │   ├── data
    │   │   └── ttbar_pu200_particles
    │   │       └── train-00000-of-01000.parquet
    │   └── metadata.json
    └── ttbar_pu200_tracker_hits
        ├── data
        │   └── ttbar_pu200_tracker_hits
        │       └── train-00000-of-01000.parquet
        └── metadata.json

I think we cannot map this at the moment...

Is this a blocker though? One can output to a location then re-arrange as needed. Or do you specifically mean allowing different shard sizes?

Comment thread Examples/Framework/src/Utilities/PerigeeParameters.cpp Outdated
Comment thread Examples/Framework/src/Utilities/PerigeeParameters.cpp Outdated
Comment thread Examples/Framework/src/Utilities/PerigeeParameters.cpp Outdated
@andiwand

andiwand commented Jun 2, 2026

Copy link
Copy Markdown
Contributor

This adds quite some complexity, I remember some of it after you explained it to me. Can we summarize this in the PR description? LLM summary would be enough from my side which could also guide the review a bit

Comment thread Examples/Framework/include/ActsExamples/Utilities/PerigeeParameters.hpp Outdated
@benjaminhuth

Copy link
Copy Markdown
Member

What flexibility is missing @benjaminhuth ?

For the record: I think I was not correct, I don't really remember what exact issue I was encountering...

benjaminhuth
benjaminhuth previously approved these changes Jun 3, 2026

@benjaminhuth benjaminhuth left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

@paulgessinger

Copy link
Copy Markdown
Member Author

@andiwand added a PR description now.

Comment thread Examples/Scripts/Python/full_chain_odd.py
@kodiakhq kodiakhq Bot merged commit 2aa0947 into acts-project:main Jun 5, 2026
45 checks passed
@github-actions github-actions Bot removed the automerge label Jun 5, 2026
benjaminhuth added a commit to benjaminhuth/acts that referenced this pull request Jun 5, 2026
Integrates the official Arrow/Parquet base (PR acts-project#5410) from upstream/main.
Our branch retains only the ColliderML reader on top:
- ArrowUtil: keep flatColumnUInt*/readFlatParquetFile (used by ColliderML)
- Parquet CMakeLists: keep ColliderMLInputConverter target block
- Python Arrow bindings: keep ColliderMLInputConverter pybind11 declarations
- root_file_hashes.txt: keep ColliderML test hashes + upstream strip_space_points rename

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@andiwand andiwand modified the milestones: next, v46.8.0 Jun 11, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Component - Documentation Affects the documentation Component - Examples Affects the Examples module Component - Plugins Affects one or more Plugins

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants