feat: Initial Arrow conversion and parquet output#5410
Conversation
db8a828 to
1cbaaba
Compare
benjaminhuth
left a comment
There was a problem hiding this comment.
I have some comments, but not fully finished the review. One of the main points is the question if we need to bundle the arrow-interface and the Calo EDM together in one PR
a09f868 to
eccb75c
Compare
eccb75c to
d890ddd
Compare
|
Hmm maybe we need to revise the flexibility of the dataset reading: a download of the ColliderML dataset with the colliderml library looks like this: I think we cannot map this at the moment... |
Is this a blocker though? One can output to a location then re-arrange as needed. Or do you specifically mean allowing different shard sizes? |
|
This adds quite some complexity, I remember some of it after you explained it to me. Can we summarize this in the PR description? LLM summary would be enough from my side which could also guide the review a bit |
For the record: I think I was not correct, I don't really remember what exact issue I was encountering... |
|
@andiwand added a PR description now. |
Integrates the official Arrow/Parquet base (PR acts-project#5410) from upstream/main. Our branch retains only the ColliderML reader on top: - ArrowUtil: keep flatColumnUInt*/readFlatParquetFile (used by ColliderML) - Parquet CMakeLists: keep ColliderMLInputConverter target block - Python Arrow bindings: keep ColliderMLInputConverter pybind11 declarations - root_file_hashes.txt: keep ColliderML test hashes + upstream strip_space_points rename Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Adds an Apache Arrow / Parquet output path for the examples framework, so event data (particles, sim hits, tracks) can be written to a columnar, sharded Parquet dataset and read back per-event — from both C++ and Python/pyarrow.
What
New
ActsPluginArrowplugin wrapping Apache Arrow. Arrow is linkedwith hidden visibility so none of its symbols leak across the
.soboundary (enforced by exported-symbol lists). Two visibility-exported,
pybind-friendly handles —
ArrowSchemaHandleandArrowTable— moveschemas and tables across library and language boundaries via the Arrow
C Data Interface, giving ABI-safe pyarrow interop without exposing
arrow's own typeinfo.
Parquet I/O (
Examples/Io/Parquet):ParquetWriter: sharded dataset, one directory per collection. Eventsare routed to shard files by
event_id, so each shard owns a disjointevent-id range and footer min/max statistics stay tight — letting the
reader prune to a single fragment per lookup. Row-group buffering bounds
peak memory; per-collection schemas are validated on write.
ParquetReader: dataset reader with per-event lookup via filterpushdown + footer-statistics pruning, and added-column schema evolution
against an optional target schema.
Output converters (
Examples/Io/Arrow) for particles, sim hits andtracks. Each builds a per-event nested table: one row per event, every
field a
list<T>whose single element holds that event's values, withevent_idas the outer routing key.Python bindings for the converters, reader and writer, with the arrow
schema bridged into Python so the producing and consuming sides share one
schema handle.
Wired into
full_chain_odd.py(new--output-parquetflag); covered bytest_arrow.pyand an ABI-isolation test (test_arrow_isolation.py).Why
ACTS examples lacked a standard columnar output. Parquet gives compact, typed, schema-stable files that downstream ML/analysis tooling consumes directly, and the C-Data-interface design lets the same buffers be used from pyarrow with no copies and no arrow-symbol clashes.
Blocked by: