feat: measurement and sim-hit tables for the Arrow/Parquet output#5586
Draft
murnanedaniel wants to merge 5 commits into
Draft
feat: measurement and sim-hit tables for the Arrow/Parquet output#5586murnanedaniel wants to merge 5 commits into
murnanedaniel wants to merge 5 commits into
Conversation
Emit tracker_hits as one row per measurement instead of one per sim-hit, with the contributing sim-hits' truth (particle_id, true_x/true_y/true_z, time) as nested list<list<>> columns (mirroring the calo contrib_* pattern). The row index is the measurement id, so there is no measurement_id column and tracks reference measurements by that index. Tracks: hit_ids are measurement indices, num_measurements is dropped (== len(hit_ids)), and hit_outlier is carried per measurement-state. Sim-hits contributing to no measurement are dropped; the dropped fraction is checked at runtime against maxUnmatchedSimHitFraction (default 0.1%).
ArrowSimHitOutputConverter becomes the TRUTH-table converter: one entry per sim-hit in container order (the position is the sim-hit id the measurement table references). Standalone-complete for re-digitization: true position + time, 4-momentum at the hit (momentum4Before), deposited energy, particle link (row index into the particle table), hit index along the trajectory, and sensor identification. Drops the measurement inputs entirely; the sim/reco split follows the Release-2 schema discussion (truth in one file, reco in another, links between them). Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
…erter) One entry per measurement (position = measurement id referenced by track hit_ids): local parameters and variances always expanded to the full bound layout with a subspace bitmask (bit0=loc0, bit1=loc1, bit2=time) saying which components were measured; measured time + variance; reco global position; sensor ids; cluster-shape features from the geometric digitization (sizes, n_channels, sum_activation, local/global eta-phi and incidence angles); and truth LINKS only - particle_ids (particle-table rows) and simhit_ids (tracker_simhits rows, same event). The unmatched-sim-hit fraction is a warning-only diagnostic now that the truth table is complete. Python bindings for the converter + measurementSchema(); test_arrow.py gains the two-table contract test (link resolution, particle consistency between tables, subspace semantics, envelope checks). Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
UnsafeAppend without Reserve writes past the buffer - heap corruption that surfaced as a Sequencer::run() segfault on the first real event. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
…erter Generalizes the converter beyond geometric-digitization MC workflows: - inputClusters optional: without it (e.g. smearing-only digitization) the shape columns are emitted as zeros, keeping the schema stable. - inputSimHits + inputSimHitMeasurementsMap optional as a pair: without them (data, or reco-only conversion) the particle_ids/simhit_ids link columns are emitted as empty lists and the unmatched diagnostic is skipped. Adds a reco-only unit test (no truth, no clusters wired): reco columns populated, links empty, shape zeroed. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Contributor
|
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.



Follow-up to #5410: splits tracker-hit output into a reco table and a truth table, linked by row indices.
tracker_hits- one row per measurement (row index = the id in trackhit_ids):loc0,loc1,var_loc0,var_loc1(full bound layout)time,var_timesubspacebitmask (bit0/1/2 = loc0/loc1/time measured)x,y,z(reco global)detector,volume_id,layer_id,surface_idsize_loc0,size_loc1,n_channels,sum_activation,local_eta,local_phi,global_eta,global_phi,eta_angle,phi_angleparticle_ids,simhit_ids(nested truth links; merged clusters carry several contributors)tracker_simhits- one row per sim-hit, all sim-hits, in container order (= thesimhit_idsindices):true_x,true_y,true_z,true_timetpx,tpy,tpz,tE,dEparticle_id,hit_indexdetector,volume_id,layer_id,surface_idOnly
inputMeasurements+trackingGeometryare required: clusters and truth inputs are optional (shape columns zeroed / truth links empty), all collection names configurable, and thedetectornumbering can be supplied as a custom mapping. Python bindings and tests included.The shape columns give cheap cluster topology without writing out full cell lists (~1M cells/event at ttbar PU200). They currently come out zeroed: clusters reach the converter with empty
channelseven in geometric digitization, thoughlocalParametersfills them andModuleClustersshould preserve them. Am I misreading the flow, or are the cells dropped somewhere? The columns are there either way, so they'll fill in once the cells do.Validated on the ODD up to ttbar PU200: ~2% merged measurements, <=0.006% unmatched sim-hits, all links resolve.
cc @paulgessinger @andiwand @benjaminhuth