Skip to content

Commit 2aa0947

Browse files
feat: Initial Arrow conversion and parquet output (#5410)
Adds an Apache Arrow / Parquet output path for the examples framework, so event data (particles, sim hits, tracks) can be written to a columnar, sharded Parquet dataset and read back per-event — from both C++ and Python/pyarrow. ### What - **New `ActsPluginArrow` plugin** wrapping Apache Arrow. Arrow is linked with hidden visibility so none of its symbols leak across the `.so` boundary (enforced by exported-symbol lists). Two visibility-exported, pybind-friendly handles — `ArrowSchemaHandle` and `ArrowTable` — move schemas and tables across library and language boundaries via the Arrow **C Data Interface**, giving ABI-safe pyarrow interop without exposing arrow's own typeinfo. - **Parquet I/O (`Examples/Io/Parquet`)**: - `ParquetWriter`: sharded dataset, one directory per collection. Events are routed to shard files by `event_id`, so each shard owns a disjoint event-id range and footer min/max statistics stay tight — letting the reader prune to a single fragment per lookup. Row-group buffering bounds peak memory; per-collection schemas are validated on write. - `ParquetReader`: dataset reader with per-event lookup via filter pushdown + footer-statistics pruning, and added-column schema evolution against an optional target schema. - **Output converters (`Examples/Io/Arrow`)** for particles, sim hits and tracks. Each builds a per-event **nested** table: one row per event, every field a `list<T>` whose single element holds that event's values, with `event_id` as the outer routing key. - **Python bindings** for the converters, reader and writer, with the arrow schema bridged into Python so the producing and consuming sides share one schema handle. - Wired into `full_chain_odd.py` (new `--output-parquet` flag); covered by `test_arrow.py` and an ABI-isolation test (`test_arrow_isolation.py`). ### Why ACTS examples lacked a standard columnar output. Parquet gives compact, typed, schema-stable files that downstream ML/analysis tooling consumes directly, and the C-Data-interface design lets the same buffers be used from pyarrow with no copies and no arrow-symbol clashes. Blocked by: - #5439 - #5440
1 parent 6bb8dce commit 2aa0947

35 files changed

Lines changed: 4247 additions & 6 deletions

CMakeLists.txt

Lines changed: 19 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -76,6 +76,15 @@ option(ACTS_USE_SYSTEM_GBL "Use a system-provided General Broken Lines (GBL) fro
7676
option(ACTS_USE_SYSTEM_MILLE "Use a system-provided Mille" ON)
7777

7878
option(ACTS_BUILD_PLUGIN_ACTSVG "Build SVG display plugin" OFF)
79+
option(ACTS_BUILD_PLUGIN_ARROW "Build Apache Arrow/Parquet plugin" OFF)
80+
option(ACTS_ARROW_ISOLATED
81+
"Statically link arrow/parquet into libActsPluginArrow and hide their \
82+
symbols, so they cannot collide with other arrow consumers in the same \
83+
process (e.g. pyarrow's bundled libarrow). Disable only if your arrow \
84+
install is ABI-compatible with every other arrow consumer that will be \
85+
loaded alongside ACTS."
86+
ON
87+
)
7988
option(ACTS_BUILD_PLUGIN_DD4HEP "Build DD4hep plugin" OFF)
8089
option(ACTS_BUILD_PLUGIN_EDM4HEP "Build EDM4hep plugin" OFF)
8190
option(ACTS_BUILD_PLUGIN_FPEMON "Build FPE monitoring plugin" OFF)
@@ -124,6 +133,7 @@ option(ACTS_BUILD_EXAMPLES_FASTJET "Build FastJet plugin" OFF)
124133
option(ACTS_BUILD_EXAMPLES_GEANT4 "Build Geant4-based code in the examples" OFF)
125134
option(ACTS_BUILD_EXAMPLES_GNN "Build the GNN example code" OFF)
126135
option(ACTS_BUILD_EXAMPLES_HASHING "Build Hashing-based code in the examples" OFF)
136+
option(ACTS_BUILD_EXAMPLES_PARQUET "Build Arrow/Parquet-based code in the examples" OFF)
127137
option(ACTS_BUILD_EXAMPLES_PODIO "Build Podio-based code in the examples" OFF)
128138
option(ACTS_BUILD_EXAMPLES_PYTHIA8 "Build Pythia8-based code in the examples" OFF)
129139
option(ACTS_BUILD_EXAMPLES_PYTHON_BINDINGS "[Deprecated] Build python bindings and enables the examples" OFF)
@@ -182,6 +192,8 @@ set_option_if(
182192
OR
183193
ACTS_BUILD_EXAMPLES_HASHING
184194
OR
195+
ACTS_BUILD_EXAMPLES_PARQUET
196+
OR
185197
ACTS_BUILD_EXAMPLES_PODIO
186198
OR
187199
ACTS_BUILD_EXAMPLES_PYTHIA8
@@ -200,6 +212,7 @@ set_option_if(
200212
)
201213
set_option_if(ACTS_BUILD_PLUGIN_EDM4HEP ACTS_BUILD_EXAMPLES_EDM4HEP)
202214
set_option_if(ACTS_BUILD_EXAMPLES_PODIO ACTS_BUILD_EXAMPLES_EDM4HEP)
215+
set_option_if(ACTS_BUILD_PLUGIN_ARROW ACTS_BUILD_EXAMPLES_PARQUET)
203216
set_option_if(ACTS_BUILD_PLUGIN_GEANT4 ACTS_BUILD_EXAMPLES_GEANT4)
204217
set_option_if(
205218
ACTS_BUILD_PLUGIN_ROOT
@@ -336,6 +349,7 @@ set(_acts_covfie_version 0.15.2)
336349
set(_acts_vecmem_version 1.24.0)
337350
set(_acts_annoy_version 1.17.3)
338351
set(_acts_fastjet_version 3.4.1)
352+
set(_acts_arrow_version 23.0.0)
339353

340354
# Help with compiler flags discovery
341355
include(ActsFunctions)
@@ -582,6 +596,11 @@ endif()
582596
if(ACTS_BUILD_PLUGIN_GEANT4)
583597
find_package(Geant4 ${_acts_geant4_version} REQUIRED CONFIG COMPONENTS gdml)
584598
endif()
599+
if(ACTS_BUILD_PLUGIN_ARROW)
600+
find_package(Arrow ${_acts_arrow_version} REQUIRED CONFIG)
601+
find_package(Parquet ${_acts_arrow_version} REQUIRED CONFIG)
602+
find_package(ArrowDataset ${_acts_arrow_version} REQUIRED CONFIG)
603+
endif()
585604

586605
if(ACTS_SETUP_VECMEM)
587606
if(ACTS_USE_SYSTEM_VECMEM)

Examples/Io/Arrow/CMakeLists.txt

Lines changed: 51 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,51 @@
1+
add_library(
2+
ActsExamplesIoArrow_obj
3+
OBJECT
4+
src/ArrowParticleOutputConverter.cpp
5+
src/ArrowSimHitOutputConverter.cpp
6+
src/ArrowTrackOutputConverter.cpp
7+
)
8+
9+
target_include_directories(
10+
ActsExamplesIoArrow_obj
11+
PRIVATE
12+
$<BUILD_INTERFACE:${CMAKE_CURRENT_SOURCE_DIR}/include>
13+
# Genex (not target_link_libraries) for Plugin and sibling OBJECT
14+
# donor: a link-level dep on ActsPluginArrow would create a cycle
15+
# since it absorbs this OBJECT lib's .o files.
16+
$<TARGET_PROPERTY:Acts::ExamplesIoParquet,INTERFACE_INCLUDE_DIRECTORIES>
17+
$<TARGET_PROPERTY:Acts::PluginArrow,INTERFACE_INCLUDE_DIRECTORIES>
18+
)
19+
20+
target_link_libraries(
21+
ActsExamplesIoArrow_obj
22+
PRIVATE Acts::ExamplesFramework Acts::ArrowLinkage Acts::ParquetLinkage
23+
)
24+
25+
set_target_properties(
26+
ActsExamplesIoArrow_obj
27+
PROPERTIES POSITION_INDEPENDENT_CODE ON
28+
)
29+
if(ACTS_ARROW_ISOLATED)
30+
set_target_properties(
31+
ActsExamplesIoArrow_obj
32+
PROPERTIES CXX_VISIBILITY_PRESET hidden VISIBILITY_INLINES_HIDDEN YES
33+
)
34+
endif()
35+
36+
target_sources(
37+
ActsPluginArrow
38+
PRIVATE $<TARGET_OBJECTS:ActsExamplesIoArrow_obj>
39+
)
40+
41+
acts_add_library(ExamplesIoArrow INTERFACE)
42+
target_include_directories(
43+
ActsExamplesIoArrow
44+
INTERFACE $<BUILD_INTERFACE:${CMAKE_CURRENT_SOURCE_DIR}/include>
45+
)
46+
target_link_libraries(
47+
ActsExamplesIoArrow
48+
INTERFACE Acts::ExamplesIoParquet Acts::PluginArrow Acts::ExamplesFramework
49+
)
50+
51+
acts_compile_headers(ExamplesIoArrow GLOB include/**/*.hpp)
Lines changed: 59 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,59 @@
1+
// This file is part of the ACTS project.
2+
//
3+
// Copyright (C) 2016 CERN for the benefit of the ACTS project
4+
//
5+
// This Source Code Form is subject to the terms of the Mozilla Public
6+
// License, v. 2.0. If a copy of the MPL was not distributed with this
7+
// file, You can obtain one at https://mozilla.org/MPL/2.0/.
8+
9+
#pragma once
10+
11+
#include "ActsExamples/EventData/SimParticle.hpp"
12+
#include "ActsExamples/Framework/DataHandle.hpp"
13+
#include "ActsExamples/Io/Parquet/ArrowOutputConverter.hpp"
14+
#include "ActsPlugins/Arrow/ArrowUtil.hpp"
15+
#include "ActsPlugins/Arrow/Export.hpp"
16+
17+
#include <memory>
18+
#include <string>
19+
#include <vector>
20+
21+
namespace ActsExamples {
22+
23+
/// Convert a @c SimParticleContainer to an @c arrow::Table.
24+
///
25+
/// The output table has one row per particle with columns for id, PDG code,
26+
/// charge, mass, and the initial-state four-momentum / four-position. The
27+
/// table is placed on the whiteboard under the configured key; the
28+
/// @c ParquetWriter picks it up from there and stamps the @c event_id column.
29+
class ACTS_ARROW_EXPORT ArrowParticleOutputConverter final
30+
: public ArrowOutputConverter {
31+
public:
32+
struct Config {
33+
/// Input @c SimParticleContainer on the whiteboard.
34+
std::string inputParticles;
35+
/// Output whiteboard key for the resulting @c arrow::Table.
36+
std::string outputTable = "particles";
37+
};
38+
39+
explicit ArrowParticleOutputConverter(
40+
const Config& cfg, std::unique_ptr<const Acts::Logger> logger = nullptr);
41+
42+
~ArrowParticleOutputConverter() override;
43+
44+
const Config& config() const { return m_cfg; }
45+
46+
std::vector<std::string> collections() const override;
47+
48+
private:
49+
ProcessCode execute(const AlgorithmContext& ctx) const override;
50+
51+
Config m_cfg;
52+
53+
ReadDataHandle<SimParticleContainer> m_inputParticles{this, "InputParticles"};
54+
55+
WriteDataHandle<ActsPlugins::ArrowUtil::ArrowTable> m_outputTable{
56+
this, "OutputTable"};
57+
};
58+
59+
} // namespace ActsExamples
Lines changed: 107 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,107 @@
1+
// This file is part of the ACTS project.
2+
//
3+
// Copyright (C) 2016 CERN for the benefit of the ACTS project
4+
//
5+
// This Source Code Form is subject to the terms of the Mozilla Public
6+
// License, v. 2.0. If a copy of the MPL was not distributed with this
7+
// file, You can obtain one at https://mozilla.org/MPL/2.0/.
8+
9+
#pragma once
10+
11+
#include "Acts/Geometry/GeometryIdentifier.hpp"
12+
#include "ActsExamples/EventData/Cluster.hpp"
13+
#include "ActsExamples/EventData/SimHit.hpp"
14+
#include "ActsExamples/EventData/SimParticle.hpp"
15+
#include "ActsExamples/EventData/TruthMatching.hpp"
16+
#include "ActsExamples/Framework/DataHandle.hpp"
17+
#include "ActsExamples/Io/Parquet/ArrowOutputConverter.hpp"
18+
#include "ActsPlugins/Arrow/ArrowUtil.hpp"
19+
#include "ActsPlugins/Arrow/Export.hpp"
20+
21+
#include <cstdint>
22+
#include <functional>
23+
#include <memory>
24+
#include <string>
25+
#include <unordered_map>
26+
#include <vector>
27+
28+
namespace ActsExamples {
29+
30+
/// Convert a @c SimHitContainer to an @c arrow::Table.
31+
///
32+
/// The output table has one row per event with list-valued columns. Hits are
33+
/// emitted in @c SimHitContainer iteration order, so the row index of a hit
34+
/// inside the per-event list equals its @c SimHitIndex; downstream tables
35+
/// (e.g. tracks) can therefore reference hits by that index.
36+
///
37+
/// When @c inputClusters and @c inputSimHitMeasurementsMap are both provided,
38+
/// the precomputed digitized cluster position (@c Cluster::globalPosition) of
39+
/// the matched measurement is written into @c x,y,z. Clusters have a one-to-one
40+
/// relation with measurements, so the @c SimHitMeasurementsMap (keyed by
41+
/// @c SimHitIndex, valued by measurement index) doubles as a sim-hit → cluster
42+
/// map. Otherwise those columns are filled with NaN. The truth position is
43+
/// always written into @c true_x,true_y,true_z.
44+
class ACTS_ARROW_EXPORT ArrowSimHitOutputConverter final
45+
: public ArrowOutputConverter {
46+
public:
47+
struct Config {
48+
/// Input @c SimHitContainer on the whiteboard.
49+
std::string inputSimHits;
50+
/// Optional input particle container used to resolve the hit's particle
51+
/// barcode to a row index in the corresponding parquet table. Must be the
52+
/// same container the @c ArrowParticleOutputConverter consumes for that
53+
/// table — leaving it empty forces the unmatched sentinel.
54+
std::string inputParticles;
55+
/// Optional cluster container. Required (together with the map below) to
56+
/// fill the digitized @c x,y,z columns from @c Cluster::globalPosition;
57+
/// otherwise those are NaN. Clusters are indexed one-to-one with
58+
/// measurements.
59+
std::string inputClusters;
60+
/// Optional sim-hit → measurement(s) inverse map; keyed by @c SimHitIndex.
61+
/// Because clusters and measurements share indices, the values double as
62+
/// cluster indices.
63+
std::string inputSimHitMeasurementsMap;
64+
/// Output whiteboard key for the resulting @c arrow::Table.
65+
std::string outputTable = "simhits";
66+
/// Resolves the @c detector subsystem id for a given hit's geometry id.
67+
/// Defaults to reading the geometry id's @c extra byte; users can swap
68+
/// in any custom mapping (e.g. by volume or by surface lookup) when the
69+
/// geometry-construction side hasn't stamped @c extra yet.
70+
std::function<std::uint8_t(Acts::GeometryIdentifier)> detectorResolver =
71+
[](Acts::GeometryIdentifier gid) {
72+
return static_cast<std::uint8_t>(gid.extra());
73+
};
74+
};
75+
76+
explicit ArrowSimHitOutputConverter(
77+
const Config& cfg, std::unique_ptr<const Acts::Logger> logger = nullptr);
78+
79+
/// Build a resolver from a volume-id -> detector-id lookup table.
80+
///
81+
/// This returns a pure C++ callable so Python can configure the mapping
82+
/// once without paying a Python callback roundtrip for each hit.
83+
static std::function<std::uint8_t(Acts::GeometryIdentifier)>
84+
makeVolumeIdDetectorResolver(
85+
const std::unordered_map<std::uint32_t, std::uint8_t>& volumeToDetector,
86+
std::uint8_t defaultValue = 255);
87+
88+
const Config& config() const { return m_cfg; }
89+
90+
std::vector<std::string> collections() const override;
91+
92+
private:
93+
ProcessCode execute(const AlgorithmContext& ctx) const override;
94+
95+
Config m_cfg;
96+
97+
ReadDataHandle<SimHitContainer> m_inputSimHits{this, "InputSimHits"};
98+
ReadDataHandle<SimParticleContainer> m_inputParticles{this, "InputParticles"};
99+
ReadDataHandle<ClusterContainer> m_inputClusters{this, "InputClusters"};
100+
ReadDataHandle<SimHitMeasurementsMap> m_inputSimHitMeasurementsMap{
101+
this, "InputSimHitMeasurementsMap"};
102+
103+
WriteDataHandle<ActsPlugins::ArrowUtil::ArrowTable> m_outputTable{
104+
this, "OutputTable"};
105+
};
106+
107+
} // namespace ActsExamples
Lines changed: 83 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,83 @@
1+
// This file is part of the ACTS project.
2+
//
3+
// Copyright (C) 2016 CERN for the benefit of the ACTS project
4+
//
5+
// This Source Code Form is subject to the terms of the Mozilla Public
6+
// License, v. 2.0. If a copy of the MPL was not distributed with this
7+
// file, You can obtain one at https://mozilla.org/MPL/2.0/.
8+
9+
#pragma once
10+
11+
#include "ActsExamples/EventData/Index.hpp"
12+
#include "ActsExamples/EventData/SimParticle.hpp"
13+
#include "ActsExamples/EventData/Track.hpp"
14+
#include "ActsExamples/EventData/TruthMatching.hpp"
15+
#include "ActsExamples/Framework/DataHandle.hpp"
16+
#include "ActsExamples/Io/Parquet/ArrowOutputConverter.hpp"
17+
#include "ActsPlugins/Arrow/ArrowUtil.hpp"
18+
#include "ActsPlugins/Arrow/Export.hpp"
19+
20+
#include <memory>
21+
#include <string>
22+
#include <vector>
23+
24+
namespace ActsExamples {
25+
26+
/// Convert a @c ConstTrackContainer to an @c arrow::Table.
27+
///
28+
/// The output table has one row per event with list-valued columns for the
29+
/// perigee parameters (d0, z0, phi, theta, qop), the majority truth particle
30+
/// id, the per-track measurement (hit) indices, and a running track index.
31+
/// The @c ParquetWriter stamps the @c event_id column.
32+
class ACTS_ARROW_EXPORT ArrowTrackOutputConverter final
33+
: public ArrowOutputConverter {
34+
public:
35+
struct Config {
36+
/// Input @c ConstTrackContainer on the whiteboard.
37+
std::string inputTracks;
38+
/// Optional input track-to-particle matching on the whiteboard. If empty,
39+
/// @c majority_particle_id is filled with the unmatched sentinel.
40+
std::string inputTrackParticleMatching;
41+
/// Particle container used to resolve the matched truth particle's row
42+
/// index in the corresponding parquet table. Must be the same container
43+
/// the @c ArrowParticleOutputConverter consumes for that table — leaving
44+
/// it empty disables index resolution and forces the unmatched sentinel.
45+
std::string inputParticles;
46+
/// Optional measurement → sim-hit map. When set, each track-state's
47+
/// measurement index is translated to one or more sim-hit indices (the
48+
/// row indices of the corresponding hits parquet table); without it,
49+
/// @c hit_ids is left empty so consumers don't mistake measurement
50+
/// indices for sim-hit indices.
51+
std::string inputMeasurementSimHitsMap;
52+
/// Output whiteboard key for the resulting @c arrow::Table.
53+
std::string outputTable = "tracks";
54+
/// If false, the @c t (perigee time) column is still in the schema but
55+
/// every cell is written as null. Lets downstream readers consume a
56+
/// stable schema regardless of whether the producer carried time info.
57+
bool writeTime = true;
58+
};
59+
60+
explicit ArrowTrackOutputConverter(
61+
const Config& cfg, std::unique_ptr<const Acts::Logger> logger = nullptr);
62+
63+
const Config& config() const { return m_cfg; }
64+
65+
std::vector<std::string> collections() const override;
66+
67+
private:
68+
ProcessCode execute(const AlgorithmContext& ctx) const override;
69+
70+
Config m_cfg;
71+
72+
ReadDataHandle<ConstTrackContainer> m_inputTracks{this, "InputTracks"};
73+
ReadDataHandle<TrackParticleMatching> m_inputTrackParticleMatching{
74+
this, "InputTrackParticleMatching"};
75+
ReadDataHandle<SimParticleContainer> m_inputParticles{this, "InputParticles"};
76+
ReadDataHandle<MeasurementSimHitsMap> m_inputMeasurementSimHitsMap{
77+
this, "InputMeasurementSimHitsMap"};
78+
79+
WriteDataHandle<ActsPlugins::ArrowUtil::ArrowTable> m_outputTable{
80+
this, "OutputTable"};
81+
};
82+
83+
} // namespace ActsExamples

0 commit comments

Comments
 (0)