Skip to content

Performance: replace df.iterrows() in TrackedTipPipeline trajectory build #194

@eberrigan

Description

@eberrigan

Context

Surfaced by /review-pr on PR #190 (Performance reviewer). tracked_tip_pipeline.py (around line 247) builds result["trajectories"] via for _, row in df.iterrows(): plus a result["trajectories"].append({...}) per row. df.iterrows() is the documented pandas anti-pattern (per-row Series construction + dtype boxing).

Sub-second at current scale (4-plate × 6-track × 311-frame ≈ 7,460 rows). At 100-plate × 5,000-frame × 20-track scale (~10M rows) this becomes the dominant cost in compute.

Proposal

Replace with df.to_dict(orient="records") plus explicit dtype coercion, OR a vectorized zip over numpy arrays. Both are ~5-10× faster than iterrows. Pick whichever benchmarks better.

Acceptance

  • Benchmark before/after on the 4-plate fixture (current: ~5.1s for batch).
  • Add a perf test exercising larger-scale batch (e.g. synthetic 10-series × 100-frame × 5-track = 5,000 rows).
  • All existing tests still pass — the dict shape is unchanged.
  • No new spec requirement needed (purely an implementation perf improvement; observable contract unchanged).

Related

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions