feat: ColliderML reader#5546
Conversation
Both proto-track and KF writers now use truth_seeded_particles as the denominator. This cleanly separates seeding layer coverage (~22% gap from geoSelection config not covering all detector layers) from KF quality. Proto-track efficiency should now be ~100% for seeded particles; KF shows the actual per-seeded-particle loss. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- First PDF page is now a title slide with centred title and metadata. - Efficiency and profile plots use step-function style (horizontal bar per bin + vertical error bars, no connecting lines) via xerr=half_bin_width and fmt="none". Matches standard HEP efficiency plot conventions. - Title parameter added to make_plots() for reuse. - Slides skill updated with both conventions. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Integrates the official Arrow/Parquet base (PR acts-project#5410) from upstream/main. Our branch retains only the ColliderML reader on top: - ArrowUtil: keep flatColumnUInt*/readFlatParquetFile (used by ColliderML) - Parquet CMakeLists: keep ColliderMLInputConverter target block - Python Arrow bindings: keep ColliderMLInputConverter pybind11 declarations - root_file_hashes.txt: keep ColliderML test hashes + upstream strip_space_points rename Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
paulgessinger
left a comment
There was a problem hiding this comment.
Thanks!
To be completely honest, with the amount of massaging needed here, it's getting close to the point where I would just conclude that the current ColliderML data content is just not suitable to be used as an input here.
We can still go ahead with this, but I would make it a priority to try to augment the parquet file content to take the extra work out of this implementation, like encoding the local dimensions and a way to map the geometry ids.
/cc @murnanedaniel
There was a problem hiding this comment.
Is this the volume mapping? Should that be parquet? I think there's benefit in having this be ASCII, no? How large is it?
There was a problem hiding this comment.
I had it in CSV and thought I put it in parquet because its CSV with ten-thousands of lines of samples... but can do CSV as well
There was a problem hiding this comment.
Okay, the CSV is roughly 500KB, vs 131KB, so CSV is acceptable.
| std::optional<double> sigmaFromSmearer( | ||
| const ActsFatras::SingleParameterSmearFunction<RandomEngine>& fn) { | ||
| if (const auto* g = fn.target<const Digitization::Gauss>()) { | ||
| return g->sigma; | ||
| } | ||
| if (const auto* g = fn.target<const Digitization::GaussTrunc>()) { | ||
| return g->sigma; | ||
| } | ||
| if (const auto* g = fn.target<const Digitization::GaussClipped>()) { | ||
| return g->sigma; | ||
| } | ||
| if (const auto* g = fn.target<const Digitization::Exact>()) { | ||
| return g->sigma; | ||
| } | ||
| return std::nullopt; | ||
| } |
There was a problem hiding this comment.
I we need this information here we should either provide it as an explicit input (json) or rethink the digitization to make this part of an interface (go away from an opaque function).
There was a problem hiding this comment.
Yeah so I think we only have not-nice solutions here. But I think the canonical source of truth on subspace and sigmas are the digitization config files. I do not want to create a new file for this, and think this is the best we can do to fill in the missing info as of now.
I thought about a sigma interface, but not all smearers have a sigma canonically...
There was a problem hiding this comment.
only had a quick look but could we simply schedule the digitization after reading the hits? decoupling the digitization from the converter and breaking the input out. one could accidentally then schedule the geometric one which produces funny output but I would not worry too much for this workflow
| @@ -0,0 +1,257 @@ | |||
| #!/usr/bin/env python3 | |||
There was a problem hiding this comment.
This is probably fairly slow in python? Might be worth writing in C++ instead.
There was a problem hiding this comment.
good point, but its a do-once task...
| """ | ||
| data_dir_env = os.environ.get("COLLIDERML_DATA_DIR") |
There was a problem hiding this comment.
Why don't we just produce a file on the fly with the correct schema? I'm not a huge fan of downloading this behind-the-scenes.
There was a problem hiding this comment.
Hmm but this thing is meant to read in the ColliderML file as they are on the internet. So the second-best thing is to store a small sample of ColliderML on CERN ressources...
There was a problem hiding this comment.
Or we just generate it: the Arrow schema pretty much guarantees we're testing the right thing, and avoids both these pitfalls.
There was a problem hiding this comment.
okay, convinced. the PU0 download is roughly 350MB because it downloads the whole shard, which is really a lot...
There was a problem hiding this comment.
But one caveat: we cannot check them if the GeoID resolving and the local-to-global mapping really works in colliderml. because the data are just different.
- Add `collidermlParticleSchema()` to ArrowUtil with the exact columns ColliderML provides; fix `colliderml_truth_tracking.py` which was using the ACTS `particleSchema()` (a superset) as the expected schema for ColliderML particle files. - Add upfront schema validation in `ColliderMLInputConverter::execute()` so all downstream column accesses are guaranteed correct. - Move `readFlatParquetFile` out of the public ArrowUtil API into an anonymous-namespace helper in ColliderMLInputConverter.cpp (its only caller); add the required Arrow/Parquet includes there. - Replace the `getCol.operator()<T>()` lambda pattern in `loadColliderMLGeoIdMap` with a free template function `getFlatColumn<T>()`. - Add `--hits-dir` CLI argument to `generate_colliderml_geo_map.py` to avoid a hardcoded dataset subdirectory path. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
ParquetReader already enforces the schema via expectedSchemas/targetSchema before the table reaches ColliderMLInputConverter, so the per-field checks in execute() are redundant. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
| // Euclidean tolerance. Out-of-bounds means the geoIdMap assigned the wrong | ||
| // surface. | ||
| auto localResult = | ||
| surface->globalToLocal(ctx.geoContext, globalPos, Acts::Vector3{}, |
There was a problem hiding this comment.
direction = Acts::Vector3{} will only work for "nice" surfaces. but might be fair to assume in this case as ODD sensors are all planar. would still be worth pointing this out as these kinds of lines get copied everywhere. in the future hopefully with a comment
There was a problem hiding this comment.
yeah, we could assert that we have a regular surface at hand here.
| hitSeq.emplace_back(geoId, barcode, pos4, zero4, zero4, | ||
| static_cast<std::int32_t>(i)); |
There was a problem hiding this comment.
we don't have the direction of the hit, right? so we cannot re-digitize it
| DigitizedParameters dParams; | ||
| for (const auto& param : smearing.params) { |
There was a problem hiding this comment.
this happens per hit right now?
| std::optional<double> sigmaFromSmearer( | ||
| const ActsFatras::SingleParameterSmearFunction<RandomEngine>& fn) { | ||
| if (const auto* g = fn.target<const Digitization::Gauss>()) { | ||
| return g->sigma; | ||
| } | ||
| if (const auto* g = fn.target<const Digitization::GaussTrunc>()) { | ||
| return g->sigma; | ||
| } | ||
| if (const auto* g = fn.target<const Digitization::GaussClipped>()) { | ||
| return g->sigma; | ||
| } | ||
| if (const auto* g = fn.target<const Digitization::Exact>()) { | ||
| return g->sigma; | ||
| } | ||
| return std::nullopt; | ||
| } |
There was a problem hiding this comment.
only had a quick look but could we simply schedule the digitization after reading the hits? decoupling the digitization from the converter and breaking the input out. one could accidentally then schedule the geometric one which produces funny output but I would not worry too much for this workflow
|




Blocked by