Skip to content

feat: ColliderML reader#5546

Draft
benjaminhuth wants to merge 101 commits into
acts-project:mainfrom
benjaminhuth:feature/collider-ml-reader-arrow
Draft

feat: ColliderML reader#5546
benjaminhuth wants to merge 101 commits into
acts-project:mainfrom
benjaminhuth:feature/collider-ml-reader-arrow

Conversation

@benjaminhuth

@benjaminhuth benjaminhuth commented Jun 4, 2026

Copy link
Copy Markdown
Member

benjaminhuth and others added 2 commits June 4, 2026 14:49
Both proto-track and KF writers now use truth_seeded_particles as the
denominator. This cleanly separates seeding layer coverage (~22% gap from
geoSelection config not covering all detector layers) from KF quality.
Proto-track efficiency should now be ~100% for seeded particles; KF shows
the actual per-seeded-particle loss.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- First PDF page is now a title slide with centred title and metadata.
- Efficiency and profile plots use step-function style (horizontal bar per
  bin + vertical error bars, no connecting lines) via xerr=half_bin_width
  and fmt="none". Matches standard HEP efficiency plot conventions.
- Title parameter added to make_plots() for reuse.
- Slides skill updated with both conventions.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@github-actions github-actions Bot added this to the next milestone Jun 4, 2026
@github-actions github-actions Bot added Component - Examples Affects the Examples module Component - Plugins Affects one or more Plugins Component - Documentation Affects the documentation labels Jun 4, 2026
@github-actions

github-actions Bot commented Jun 4, 2026

Copy link
Copy Markdown
Contributor

📊: Physics performance monitoring for 476a802

Full contents

physmon summary

benjaminhuth and others added 2 commits June 5, 2026 15:28
Integrates the official Arrow/Parquet base (PR acts-project#5410) from upstream/main.
Our branch retains only the ColliderML reader on top:
- ArrowUtil: keep flatColumnUInt*/readFlatParquetFile (used by ColliderML)
- Parquet CMakeLists: keep ColliderMLInputConverter target block
- Python Arrow bindings: keep ColliderMLInputConverter pybind11 declarations
- root_file_hashes.txt: keep ColliderML test hashes + upstream strip_space_points rename

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@github-actions github-actions Bot added Changes Performance and removed Component - Documentation Affects the documentation labels Jun 5, 2026
@github-actions github-actions Bot added the Infrastructure Changes to build tools, continous integration, ... label Jun 8, 2026
Comment thread Examples/Io/Parquet/src/ColliderMLInputConverter.cpp Outdated
Comment thread Examples/Io/Arrow/src/ColliderMLInputConverter.cpp
Comment thread Examples/Io/Arrow/src/ColliderMLInputConverter.cpp
Comment thread Examples/Io/Parquet/CMakeLists.txt Outdated
Comment thread Examples/Scripts/Python/colliderml_performance_plots.py Outdated
Comment thread Plugins/Arrow/include/ActsPlugins/Arrow/ArrowUtil.hpp Outdated
Comment thread Examples/Io/Parquet/include/ActsExamples/Io/Parquet/ColliderMLInputConverter.hpp Outdated
Comment thread Examples/Io/Parquet/include/ActsExamples/Io/Parquet/ColliderMLInputConverter.hpp Outdated
@benjaminhuth benjaminhuth marked this pull request as ready for review June 9, 2026 10:28
@benjaminhuth benjaminhuth requested a review from AJPfleger as a code owner June 9, 2026 10:28
@paulgessinger paulgessinger added the 🛑 blocked This item is blocked by another item label Jun 9, 2026

@paulgessinger paulgessinger left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks!

To be completely honest, with the amount of massaging needed here, it's getting close to the point where I would just conclude that the current ColliderML data content is just not suitable to be used as an input here.

We can still go ahead with this, but I would make it a priority to try to augment the parquet file content to take the extra work out of this implementation, like encoding the local dimensions and a way to map the geometry ids.

/cc @murnanedaniel

@paulgessinger paulgessinger Jun 9, 2026

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this the volume mapping? Should that be parquet? I think there's benefit in having this be ASCII, no? How large is it?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I had it in CSV and thought I put it in parquet because its CSV with ten-thousands of lines of samples... but can do CSV as well

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay, the CSV is roughly 500KB, vs 131KB, so CSV is acceptable.

Comment thread Examples/Io/Arrow/src/ColliderMLInputConverter.cpp
Comment on lines +77 to +92
std::optional<double> sigmaFromSmearer(
const ActsFatras::SingleParameterSmearFunction<RandomEngine>& fn) {
if (const auto* g = fn.target<const Digitization::Gauss>()) {
return g->sigma;
}
if (const auto* g = fn.target<const Digitization::GaussTrunc>()) {
return g->sigma;
}
if (const auto* g = fn.target<const Digitization::GaussClipped>()) {
return g->sigma;
}
if (const auto* g = fn.target<const Digitization::Exact>()) {
return g->sigma;
}
return std::nullopt;
}

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I we need this information here we should either provide it as an explicit input (json) or rethink the digitization to make this part of an interface (go away from an opaque function).

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah so I think we only have not-nice solutions here. But I think the canonical source of truth on subspace and sigmas are the digitization config files. I do not want to create a new file for this, and think this is the best we can do to fill in the missing info as of now.

I thought about a sigma interface, but not all smearers have a sigma canonically...

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@andiwand whats your take on this?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

only had a quick look but could we simply schedule the digitization after reading the hits? decoupling the digitization from the converter and breaking the input out. one could accidentally then schedule the geometric one which produces funny output but I would not worry too much for this workflow

Comment thread Examples/Io/Arrow/src/ColliderMLInputConverter.cpp Outdated
Comment thread Examples/Scripts/Python/generate_colliderml_geo_map.py
@@ -0,0 +1,257 @@
#!/usr/bin/env python3

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is probably fairly slow in python? Might be worth writing in C++ instead.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good point, but its a do-once task...

Comment thread Plugins/Arrow/include/ActsPlugins/Arrow/ArrowUtil.hpp Outdated
Comment on lines +36 to +37
"""
data_dir_env = os.environ.get("COLLIDERML_DATA_DIR")

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why don't we just produce a file on the fly with the correct schema? I'm not a huge fan of downloading this behind-the-scenes.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm but this thing is meant to read in the ColliderML file as they are on the internet. So the second-best thing is to store a small sample of ColliderML on CERN ressources...

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Or we just generate it: the Arrow schema pretty much guarantees we're testing the right thing, and avoids both these pitfalls.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

okay, convinced. the PU0 download is roughly 350MB because it downloads the whole shard, which is really a lot...

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But one caveat: we cannot check them if the GeoID resolving and the local-to-global mapping really works in colliderml. because the data are just different.

benjaminhuth and others added 2 commits June 9, 2026 17:24
- Add `collidermlParticleSchema()` to ArrowUtil with the exact columns
  ColliderML provides; fix `colliderml_truth_tracking.py` which was
  using the ACTS `particleSchema()` (a superset) as the expected schema
  for ColliderML particle files.
- Add upfront schema validation in `ColliderMLInputConverter::execute()`
  so all downstream column accesses are guaranteed correct.
- Move `readFlatParquetFile` out of the public ArrowUtil API into an
  anonymous-namespace helper in ColliderMLInputConverter.cpp (its only
  caller); add the required Arrow/Parquet includes there.
- Replace the `getCol.operator()<T>()` lambda pattern in
  `loadColliderMLGeoIdMap` with a free template function
  `getFlatColumn<T>()`.
- Add `--hits-dir` CLI argument to `generate_colliderml_geo_map.py`
  to avoid a hardcoded dataset subdirectory path.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
ParquetReader already enforces the schema via expectedSchemas/targetSchema
before the table reaches ColliderMLInputConverter, so the per-field checks
in execute() are redundant.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@benjaminhuth benjaminhuth marked this pull request as draft June 10, 2026 06:55
// Euclidean tolerance. Out-of-bounds means the geoIdMap assigned the wrong
// surface.
auto localResult =
surface->globalToLocal(ctx.geoContext, globalPos, Acts::Vector3{},

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

direction = Acts::Vector3{} will only work for "nice" surfaces. but might be fair to assume in this case as ODD sensors are all planar. would still be worth pointing this out as these kinds of lines get copied everywhere. in the future hopefully with a comment

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah, we could assert that we have a regular surface at hand here.

Comment on lines +351 to +352
hitSeq.emplace_back(geoId, barcode, pos4, zero4, zero4,
static_cast<std::int32_t>(i));

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we don't have the direction of the hit, right? so we cannot re-digitize it

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nope

Comment on lines +403 to +404
DigitizedParameters dParams;
for (const auto& param : smearing.params) {

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this happens per hit right now?

Comment on lines +77 to +92
std::optional<double> sigmaFromSmearer(
const ActsFatras::SingleParameterSmearFunction<RandomEngine>& fn) {
if (const auto* g = fn.target<const Digitization::Gauss>()) {
return g->sigma;
}
if (const auto* g = fn.target<const Digitization::GaussTrunc>()) {
return g->sigma;
}
if (const auto* g = fn.target<const Digitization::GaussClipped>()) {
return g->sigma;
}
if (const auto* g = fn.target<const Digitization::Exact>()) {
return g->sigma;
}
return std::nullopt;
}

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

only had a quick look but could we simply schedule the digitization after reading the hits? decoupling the digitization from the converter and breaking the input out. one could accidentally then schedule the geometric one which produces funny output but I would not worry too much for this workflow

@sonarqubecloud

Copy link
Copy Markdown

Quality Gate Failed Quality Gate failed

Failed conditions
1 New Bugs (required ≤ 0)
D Reliability Rating on New Code (required ≥ A)

See analysis details on SonarQube Cloud

Catch issues before they fail your Quality Gate with our IDE extension SonarQube for IDE

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Changes Performance Component - Examples Affects the Examples module Component - Plugins Affects one or more Plugins Infrastructure Changes to build tools, continous integration, ... 🛑 blocked This item is blocked by another item

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants