feat: ColliderML reader by benjaminhuth · Pull Request #5546 · acts-project/acts

benjaminhuth · 2026-06-04T13:31:48Z

Blocked by

Both proto-track and KF writers now use truth_seeded_particles as the denominator. This cleanly separates seeding layer coverage (~22% gap from geoSelection config not covering all detector layers) from KF quality. Proto-track efficiency should now be ~100% for seeded particles; KF shows the actual per-seeded-particle loss. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

- First PDF page is now a title slide with centred title and metadata. - Efficiency and profile plots use step-function style (horizontal bar per bin + vertical error bars, no connecting lines) via xerr=half_bin_width and fmt="none". Matches standard HEP efficiency plot conventions. - Title parameter added to make_plots() for reuse. - Slides skill updated with both conventions. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

github-actions · 2026-06-04T14:45:51Z

📊: Physics performance monitoring for `476a802`

Full contents

physmon summary

Integrates the official Arrow/Parquet base (PR acts-project#5410) from upstream/main. Our branch retains only the ColliderML reader on top: - ArrowUtil: keep flatColumnUInt*/readFlatParquetFile (used by ColliderML) - Parquet CMakeLists: keep ColliderMLInputConverter target block - Python Arrow bindings: keep ColliderMLInputConverter pybind11 declarations - root_file_hashes.txt: keep ColliderML test hashes + upstream strip_space_points rename Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

…-reader-arrow

paulgessinger

Thanks!

To be completely honest, with the amount of massaging needed here, it's getting close to the point where I would just conclude that the current ColliderML data content is just not suitable to be used as an input here.

We can still go ahead with this, but I would make it a priority to try to augment the parquet file content to take the extra work out of this implementation, like encoding the local dimensions and a way to map the geometry ids.

/cc @murnanedaniel

paulgessinger · 2026-06-09T11:25:07Z

Is this the volume mapping? Should that be parquet? I think there's benefit in having this be ASCII, no? How large is it?

I had it in CSV and thought I put it in parquet because its CSV with ten-thousands of lines of samples... but can do CSV as well

Okay, the CSV is roughly 500KB, vs 131KB, so CSV is acceptable.

paulgessinger · 2026-06-09T11:40:31Z

+std::optional<double> sigmaFromSmearer(
+    const ActsFatras::SingleParameterSmearFunction<RandomEngine>& fn) {
+  if (const auto* g = fn.target<const Digitization::Gauss>()) {
+    return g->sigma;
+  }
+  if (const auto* g = fn.target<const Digitization::GaussTrunc>()) {
+    return g->sigma;
+  }
+  if (const auto* g = fn.target<const Digitization::GaussClipped>()) {
+    return g->sigma;
+  }
+  if (const auto* g = fn.target<const Digitization::Exact>()) {
+    return g->sigma;
+  }
+  return std::nullopt;
+}


I we need this information here we should either provide it as an explicit input (json) or rethink the digitization to make this part of an interface (go away from an opaque function).

Yeah so I think we only have not-nice solutions here. But I think the canonical source of truth on subspace and sigmas are the digitization config files. I do not want to create a new file for this, and think this is the best we can do to fill in the missing info as of now.

I thought about a sigma interface, but not all smearers have a sigma canonically...

@andiwand whats your take on this?

only had a quick look but could we simply schedule the digitization after reading the hits? decoupling the digitization from the converter and breaking the input out. one could accidentally then schedule the geometric one which produces funny output but I would not worry too much for this workflow

paulgessinger · 2026-06-09T11:50:35Z

@@ -0,0 +1,257 @@
+#!/usr/bin/env python3


This is probably fairly slow in python? Might be worth writing in C++ instead.

good point, but its a do-once task...

paulgessinger · 2026-06-09T11:55:23Z

+    """
+    data_dir_env = os.environ.get("COLLIDERML_DATA_DIR")


Why don't we just produce a file on the fly with the correct schema? I'm not a huge fan of downloading this behind-the-scenes.

Hmm but this thing is meant to read in the ColliderML file as they are on the internet. So the second-best thing is to store a small sample of ColliderML on CERN ressources...

Or we just generate it: the Arrow schema pretty much guarantees we're testing the right thing, and avoids both these pitfalls.

okay, convinced. the PU0 download is roughly 350MB because it downloads the whole shard, which is really a lot...

But one caveat: we cannot check them if the GeoID resolving and the local-to-global mapping really works in colliderml. because the data are just different.

- Add `collidermlParticleSchema()` to ArrowUtil with the exact columns ColliderML provides; fix `colliderml_truth_tracking.py` which was using the ACTS `particleSchema()` (a superset) as the expected schema for ColliderML particle files. - Add upfront schema validation in `ColliderMLInputConverter::execute()` so all downstream column accesses are guaranteed correct. - Move `readFlatParquetFile` out of the public ArrowUtil API into an anonymous-namespace helper in ColliderMLInputConverter.cpp (its only caller); add the required Arrow/Parquet includes there. - Replace the `getCol.operator()<T>()` lambda pattern in `loadColliderMLGeoIdMap` with a free template function `getFlatColumn<T>()`. - Add `--hits-dir` CLI argument to `generate_colliderml_geo_map.py` to avoid a hardcoded dataset subdirectory path. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

ParquetReader already enforces the schema via expectedSchemas/targetSchema before the table reaches ColliderMLInputConverter, so the per-field checks in execute() are redundant. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

andiwand · 2026-06-12T07:05:18Z

+    // Euclidean tolerance. Out-of-bounds means the geoIdMap assigned the wrong
+    // surface.
+    auto localResult =
+        surface->globalToLocal(ctx.geoContext, globalPos, Acts::Vector3{},


direction = Acts::Vector3{} will only work for "nice" surfaces. but might be fair to assume in this case as ODD sensors are all planar. would still be worth pointing this out as these kinds of lines get copied everywhere. in the future hopefully with a comment

yeah, we could assert that we have a regular surface at hand here.

andiwand · 2026-06-12T07:07:19Z

+        hitSeq.emplace_back(geoId, barcode, pos4, zero4, zero4,
+                            static_cast<std::int32_t>(i));


we don't have the direction of the hit, right? so we cannot re-digitize it

andiwand · 2026-06-12T07:10:06Z

+    DigitizedParameters dParams;
+    for (const auto& param : smearing.params) {


this happens per hit right now?

andiwand · 2026-06-12T07:12:31Z

+std::optional<double> sigmaFromSmearer(
+    const ActsFatras::SingleParameterSmearFunction<RandomEngine>& fn) {
+  if (const auto* g = fn.target<const Digitization::Gauss>()) {
+    return g->sigma;
+  }
+  if (const auto* g = fn.target<const Digitization::GaussTrunc>()) {
+    return g->sigma;
+  }
+  if (const auto* g = fn.target<const Digitization::GaussClipped>()) {
+    return g->sigma;
+  }
+  if (const auto* g = fn.target<const Digitization::Exact>()) {
+    return g->sigma;
+  }
+  return std::nullopt;
+}


only had a quick look but could we simply schedule the digitization after reading the hits? decoupling the digitization from the converter and breaking the input out. one could accidentally then schedule the geometric one which produces funny output but I would not worry too much for this workflow

sonarqubecloud · 2026-06-12T11:45:09Z

Quality Gate failed

Failed conditions
1 New Bugs (required ≤ 0)
D Reliability Rating on New Code (required ≥ A)

See analysis details on SonarQube Cloud

Catch issues before they fail your Quality Gate with our IDE extension SonarQube for IDE

paulgessinger added 30 commits May 12, 2026 16:59

feat: add parent id to existing SimParticle EDM

6ede096

feat: Make ScopedTimer threadsafe

eb9835a

particle docs fixes

37b3ec6

clang-format

06005ef

MERGE

59290dd

feat: Initial arrow/parquet support

32d6228

experiment with arrow object library

970f19d

clean up symbol visibility in wrapper target

d2080b8

make the isolated arrow absorption optional

f331c3b

add parquet option to full chain odd

3ce451b

updated particle arrow schema based on colliderml

99478a4

particle arrow converter writes parent id

c28a93c

use row indices as particle ids

5fbc174

add edm4hep to parquet conversion script

5801bf5

update output converters to produce proper nulls

299d69e

add sim hit output converter + connect to track hit_ids

74cd1ea

update detector resolver

c6b9587

add jobs arg to full chain odd

a163863

drop separate generated particles output

a824336

add plan for edm4hep input perf opt

bcd8891

clang-format

4bbc029

initial calo conversion

2de4baf

validated calo output

c421bc2

optimization for calo hits and averaging timers

8c72b08

some timing for edm4hepsiminput

fddfd47

add proper detector encoding, speedup

fed4480

restore pythia script (?)

ffbd6a8

use acts units more

fac62a5

dataset system shards files

b15e10c

address large number of propagation to perigee failures

5865ab1

benjaminhuth and others added 2 commits June 4, 2026 14:49

github-actions Bot added this to the next milestone Jun 4, 2026

github-actions Bot added Component - Examples Affects the Examples module Component - Plugins Affects one or more Plugins Component - Documentation Affects the documentation labels Jun 4, 2026

benjaminhuth and others added 2 commits June 5, 2026 15:28

update

1b7c564

github-actions Bot added Changes Performance and removed Component - Documentation Affects the documentation labels Jun 5, 2026

update unused files

134fa89

github-actions Bot added the Infrastructure Changes to build tools, continous integration, ... label Jun 8, 2026

benjaminhuth added 3 commits June 9, 2026 09:49

lint

a9391d2

Merge remote-tracking branch 'upstream/main' into feature/collider-ml…

5fce52a

…-reader-arrow

remove unrelated stuff

494f445

benjaminhuth commented Jun 9, 2026

View reviewed changes

benjaminhuth added 2 commits June 9, 2026 12:24

update

b1cf5ed

restore odd.py

a3e6abd

benjaminhuth marked this pull request as ready for review June 9, 2026 10:28

benjaminhuth requested a review from AJPfleger as a code owner June 9, 2026 10:28

benjaminhuth requested a review from paulgessinger June 9, 2026 10:29

paulgessinger added the 🛑 blocked This item is blocked by another item label Jun 9, 2026

paulgessinger reviewed Jun 9, 2026

View reviewed changes

benjaminhuth and others added 2 commits June 9, 2026 17:24

remove redundant schema validation from execute()

2636986

ParquetReader already enforces the schema via expectedSchemas/targetSchema before the table reaches ColliderMLInputConverter, so the per-field checks in execute() are redundant. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

benjaminhuth marked this pull request as draft June 10, 2026 06:55

andiwand reviewed Jun 12, 2026

View reviewed changes

update

476a802

		hitSeq.emplace_back(geoId, barcode, pos4, zero4, zero4,
		static_cast<std::int32_t>(i));

		DigitizedParameters dParams;
		for (const auto& param : smearing.params) {

Conversation

benjaminhuth commented Jun 4, 2026 • edited by paulgessinger Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

github-actions Bot commented Jun 4, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

📊: Physics performance monitoring for 476a802

physmon summary

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

paulgessinger left a comment

Choose a reason for hiding this comment

Uh oh!

paulgessinger Jun 9, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

sonarqubecloud Bot commented Jun 12, 2026

Quality Gate failed

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

benjaminhuth commented Jun 4, 2026 •

edited by paulgessinger

Loading

github-actions Bot commented Jun 4, 2026 •

edited

Loading

📊: Physics performance monitoring for `476a802`

paulgessinger Jun 9, 2026 •

edited

Loading