PR 26 of #508 — device-agnostic layer buffers + Linux spawn-context for PAF pool#563
Merged
gitttt-1234 merged 3 commits intoMay 28, 2026
Conversation
…or PAF pool Bench-surfaced bug fixes from the CUDA bench (#560 prereq) + a new MPS/CUDA regression test. Closes the "topdown predict_streaming fails with device mismatch" and "paf_workers>0 deadlocks on Linux+CUDA" issues found while benchmarking PR 25's tip on an NVIDIA A40 box. ## Root causes Several `InferenceLayer` subclasses allocated output buffers with bare `torch.full((...), float("nan"))` / `torch.ones(B)` calls — no `device=` kwarg. On CPU this is silent (everything is CPU); on **any non-CPU device** the scatter from the layer's device-resident tensors into a CPU buffer raises `RuntimeError: Expected all tensors to be on the same device`. Reproduces on MPS (Mac M-series) with the exact same shape of error as the CUDA bench reported, so the bug is **non-CPU-device-path** in scope, not CUDA-specific. Separately, `PafGroupingPool` constructs its `ProcessPoolExecutor` without an explicit `mp_context`. On Linux this defaults to **fork**, which inherits the parent's already-initialized CUDA context and deadlocks the first worker call. The fix pins `mp_context=multiprocessing.get_context("spawn")`, matching the existing default on macOS / Windows. ## Files * `sleap_nn/inference/layers/topdown.py` — 3 `torch.full` allocations now pass `device=stage2_kpts_img.device` (the working scatter source). * `sleap_nn/inference/layers/centroid.py` — `padded_peaks`, `padded_vals`, `centroid_vals`, and both `PreprocInfo.eff_scale` allocations now device-aware in both the GT branch and the postprocess. * `sleap_nn/inference/layers/centered_instance.py` — `b_idx`, `matched_vals`, `pred_centroid_values`, and `eff_scale` device-aware. * `sleap_nn/inference/layers/single_instance.py` — `eff_scale` device-aware (uses `x.device` since this layer has no `scaled` variable in scope). * `sleap_nn/inference/layers/bottomup.py`, `sleap_nn/inference/layers/bottomup_multiclass.py`, `sleap_nn/inference/layers/topdown_multiclass.py` — `eff_scale` device-aware via `scaled.device`. * `sleap_nn/inference/streaming.py` — `PafGroupingPool.__enter__` pins the `spawn` start method explicitly. Docstring updated. * `tests/inference/test_e2e_video.py` (new, 10 tests = 5 CPU + 5 MPS-gated): real fixture ckpt → `VideoProvider(small_robot.mp4)` → `predict_streaming()` for every supported model type. Pre-fix the MPS `topdown` case raised the device-mismatch error; post-fix all 10 pass. ## Why the existing test suite missed these Every pre-existing inference test either (a) used `_StubLayer` instead of a real backend, (b) used `NumpyProvider` with synthetic frames, or (c) mocked the factory. None exercised the actual `video → preprocess → backend.forward → postprocess → Outputs` chain on a real fixture. The new `tests/inference/test_e2e_video.py` plugs that gap. ## Out of scope The CUDA bench also showed two **channel-mismatch** failures (centroid-only + bottom-up `predict_streaming` on real video — both reporting `weight=[36, 72, 3, 3], expected 72 channels, got 36`). These reproduce **only on CUDA** (clean on CPU and MPS with the same code + checkpoint + video). Probably cuDNN strictness or a torch 2.9.1 + non-square input interaction with UNet skip connections. Need CUDA hardware to fix; will file as a separate issue with the bench traceback attached. ## Tests ``` tests/inference/test_e2e_video.py 10 passed (5 CPU + 5 MPS) tests/inference/test_paf_worker_pool.py 8 passed (spawn-context fix intact, no regressions) tests/inference/ + cli/ + test_instance_centroids 414 passed, 23 skipped (CUDA-gated) black --check sleap_nn tests clean ruff check sleap_nn/ clean ``` Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Codecov Report❌ Patch coverage is Additional details and impacted files@@ Coverage Diff @@
## divya/inf-refactor-25-centroid-only-inference #563 +/- ##
=================================================================================
+ Coverage 64.20% 64.46% +0.26%
=================================================================================
Files 124 125 +1
Lines 19066 19431 +365
=================================================================================
+ Hits 12241 12527 +286
- Misses 6825 6904 +79 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
6 tasks
…#564) Stacked on #563 (PR 26). Closes the parity gaps surfaced by the post-bench audit at `scratch/2026-04-30-inference-refactor-implementation/parity_audit/parity_report.md`. ## Why this PR exists The CUDA bench (#560) and subsequent audit revealed that the new inference flow was **silently producing wrong outputs** on any video not coincidentally matching the model's training dimensions. The PR-0 parity goldens didn't catch it because they covered the wrong slice — pinning model-forward parity (give the model the same preprocessed input, get the same output) instead of pipeline parity (raw video → preprocess → forward → postprocess → final keypoints). The audit catalogued **10 numbered divergences** between legacy `sleap_nn.inference.predictors.Predictor.from_model_paths` and new `sleap_nn.inference.factory.from_model_paths`. This PR closes all of them. ## What's in the PR (3 commits) ### 1. `PR 4 of #508 (deferred): shared full-preprocess helper across InferenceLayers` `InferenceLayer._apply_full_preprocess(x, max_stride, unsqueeze_n_samples)` runs the legacy chain in order: 1. `ensure_rgb` / `ensure_grayscale` (channel coercion) 2. Per-sample `apply_sizematcher` to `(max_h, max_w)` → produces `eff_scale` tensor 3. `resize_image` by `preprocess_config.scale` (input_scale) 4. `apply_pad_to_stride` to `max_stride` 5. `unsqueeze(dim=1)` for the n_samples Lightning-forward contract Every raw-frame layer's `preprocess()` delegates to it: `SingleInstance`, `Centroid`, `CenteredInstance`, `BottomUp`, `BottomUpMultiClass`, `TopDownMultiClass`. Each step short-circuits when its config field is the identity. `SingleInstanceLayer.__init__` gains a `max_stride` arg (was missing). ### 2. `PR 8 + 11 of #508 (deferred): factory forwards preprocess fields; .slp ingestion works; uint8 preserved` Three fixes: - **Factory wiring** — `factory.from_model_paths` now reads `preprocess_config.{max_height, max_width, ensure_rgb, ensure_grayscale}` off the legacy predictor (which resolves them from `training_config.yaml`) and threads them into every layer's `PreprocessConfig`. Centroid layers get the sizematcher fields; centered-instance layers in topdown composition intentionally don't (they receive per-instance crops, not raw frames — sizematcher there would upsize the crops). - **uint8 preservation** — split `_to_4d_float_tensor` into `_to_4d_tensor` (layout only, dtype-preserving) + `_to_4d_float_tensor` (thin float wrapper for backward compat). Every layer's `preprocess()` uses `_to_4d_tensor` so uint8 stays uint8 through `tvf.resize`. The eager `.float()` was producing `255.00006...` after resize, off-by-noise from legacy's clean uint8 path. `normalize_on_gpu` inside the Lightning forward handles uint8→float32 conversion. - **`Predictor._batch_iter` instances kwarg** — only forwards `batch.instances` to layers whose `predict` signature accepts the kwarg (via `inspect.signature`). Pre-fix, `.slp` ingestion raised `TypeError` on every layer except centroid/topdown. ### 3. `PR 27 of #508: topdown crops from sized image; permanent parity-vs-legacy test` - **TopDownLayer crops from the sized image** (post-sizematcher), not the raw frame. Legacy `CentroidCrop` extracts `crop_hw` crops from the sized image; the centered_instance model was trained on those sized-space crops. The new flow was extracting crops from the raw frame, producing crops covering a slightly different physical region (96×96 raw pixels vs 96×96 sized pixels ≈ 140×140 raw pixels when `eff_scale=0.686`). Median drift on topdown × small_robot.mp4 was ~15 px. `TopDownLayer.predict` now re-applies the centroid layer's sizematcher (via `_sizematch_like_centroid_layer`) to recover the sized image + per-sample `eff_scale`, converts `centroids` back to sized space for bbox construction, runs stage 2 in sized space, then divides the final keypoints + bboxes by `eff_scale` to land in original-image space. - **`tests/inference/test_parity_vs_legacy.py`** — permanent guardrail. 6 parametrized tests asserting final-keypoint parity between legacy and new `Predictor` on every fixture × `{small_robot.mp4, minimal_instance.pkg.slp}` within `atol/rtol=1e-4`. - **`tests/inference/layers/test_topdown.py::test_centroid_nms_dedupes_close_centroids`** updated to stub the new `preprocess_config` + `_to_4d_tensor` attributes on its `CentroidLayer.__new__(...)` mock. ## Final parity results | fixture × source | model-input parity | final-keypoint parity | |---|---|---| | single_instance × small_robot.mp4 | ✓ identical | ✅ 0.0000 px (strict) | | single_instance × minimal_instance.pkg.slp | ✓ identical | ✅ 0.0000 px (strict) | | topdown × small_robot.mp4 | ✓ identical | ✅ 0.0001 px (strict) | | topdown × minimal_instance.pkg.slp | ✓ both stages | ✅ 0.0000 px (strict) | | bottomup × small_robot.mp4 | ✓ identical | ✅ 0.0000 px (strict) | | bottomup × minimal_instance.pkg.slp | ✓ identical | ✅ 0.0000 px (strict) | Pre-PR-27 the same audit showed: - `single_instance × small_robot.mp4`: input shape `(4,3,320,560)` vs legacy `(4,1,3,160,280)` (no input_scale, no n_samples wrap) - `topdown / bottomup × small_robot.mp4`: input mean **53 vs 93** (sizematcher missing entirely) - `topdown × small_robot.mp4`: final keypoints **41.8 px max nearest-neighbour drift** between flows - `.slp` ingestion: `TypeError: InferenceLayer.predict() got unexpected keyword argument 'instances'` on every non-centroid layer ## How this happened The PR-0 goldens captured the model's input + output from the legacy flow's `InferenceModel.forward`. The new layer tests then asserted that, given the same model input, the layer produces the same model output. That's model-forward parity; it doesn't exercise the preprocessing chain (sizematcher / channel coercion / dtype / n_samples wrap). Because the goldens were the only acceptance gate, every PR in the stack passed "parity within 1e-5" while the preprocessing in the new flow was silently incomplete. The first time real video frames entered through `VideoProvider`, the divergence surfaced — visible in the CUDA bench as a `RuntimeError` (channel mismatch on cuDNN) and on Mac CPU/MPS as silently wrong predictions. The new `tests/inference/test_parity_vs_legacy.py` is the gate that should have existed since PR 0. It exercises the full `from_model_paths(ckpt).predict(source)` pipeline and compares final keypoints against legacy. ## Test plan - [x] `pytest tests/inference/test_parity_vs_legacy.py` — 6 passed (0.0000–0.0001 px max diff). - [x] `pytest tests/inference/ tests/cli/ tests/data/test_instance_centroids.py` — 418 passed, 23 skipped (CUDA-gated), 1 xfailed (PR-0 single-instance golden test, marked xfail with note pointing here). - [x] `pytest tests/inference/layers/test_topdown.py::test_centroid_nms_dedupes_close_centroids` — passes after stubbing. - [x] `black --check sleap_nn tests` — clean. - [x] `ruff check sleap_nn/` — clean. - [ ] Re-run CUDA bench on the A40 box. Section C centroid + bottomup channel-mismatch errors are expected to clear (they were sizematcher in disguise). ## Out of scope - The xfailed `test_single_instance_layer_parity_vs_pr0_golden` test was written against the old Option-B contract (caller pre-preprocesses, layer.preprocess is a no-op). PR 27 moves the layer to Option-A (layer.predict(raw_frame) does the full pipeline) so feeding pre-scaled input now double-scales. The new `test_parity_vs_legacy.py` supersedes it as the parity guardrail. - ONNX `Exported*Layer` adapters were not audited and likely have the same anti-pattern (they bypass `_apply_full_preprocess`). Separate follow-up after #560 reruns confirm the in-flow path is correct. 🤖 Generated with [Claude Code](https://claude.com/claude-code) --------- Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
cca171d
into
divya/inf-refactor-25-centroid-only-inference
6 checks passed
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Stacked on #562 (PR 25). Closes two bug families surfaced by the CUDA bench against PR 25's tip on an NVIDIA A40 box.
What the bench surfaced
predict_streamingdevice mismatchcuda:0 vs cpuandmps:0 vs cpupaf_workers > 0hangs foreverThe first one I'd assumed was CUDA-specific. Testing on Mac MPS reproduced it with the exact same shape of error — confirming the root cause is device-agnostic anti-pattern in the new flow, not a CUDA quirk.
Root cause 1 — CPU-allocated output buffers across the layers
Several
InferenceLayersubclasses allocated output buffers with baretorch.full((...), float("nan"))/torch.ones(B)calls. Nodevice=kwarg. On CPU this is silent (everything is CPU); on any non-CPU device the scatter from the layer's device-resident tensors into a CPU buffer raisesRuntimeError: Expected all tensors to be on the same device.Files touched (the same anti-pattern fixed everywhere):
sleap_nn/inference/layers/topdown.py— 3torch.fullallocations now device-aware (stage2_kpts_img.device).sleap_nn/inference/layers/centroid.py—padded_peaks,padded_vals,centroid_vals, botheff_scaleallocations now device-aware in both the GT branch and the postprocess. This was the actual upstream source of the topdown failure — the centroid layer was returning CPU-residentpred_centroidseven when running on MPS/CUDA, which then poisoned the scatter in topdown stage 2.sleap_nn/inference/layers/centered_instance.py—b_idx,matched_vals,pred_centroid_values,eff_scaledevice-aware.sleap_nn/inference/layers/single_instance.py—eff_scaledevice-aware (usesx.device).sleap_nn/inference/layers/{bottomup, bottomup_multiclass, topdown_multiclass}.py—eff_scaledevice-aware viascaled.device.Root cause 2 —
PafGroupingPoolfork-on-Linux deadlocks against CUDAsleap_nn/inference/streaming.py:328constructed itsProcessPoolExecutorwithout an explicitmp_context. On Linux this defaults to fork, which inherits the parent's already-initialized CUDA context and deadlocks the first worker call. Pinnedmp_context=multiprocessing.get_context(\"spawn\")— matches the existing default on macOS / Windows.Why the existing test suite missed all of this
Every pre-existing inference test either (a) used
_StubLayerinstead of a real backend, (b) usedNumpyProviderwith synthetic frames, or (c) mocked the factory. None exercised the actualvideo → preprocess → backend.forward → postprocess → Outputschain on a real fixture. The newtests/inference/test_e2e_video.pyplugs that gap.10 parametrized tests = 5 CPU + 5 MPS-gated, one per supported model type (single_instance, centroid_only, topdown, bottomup, multiclass_bottomup). Pre-fix the MPS topdown case raised the device-mismatch error; post-fix all 10 pass.
Out of scope for this PR
The CUDA bench also showed channel-mismatch failures for centroid-only + bottom-up
predict_streamingon a real video (weight=[36, 72, 3, 3], expected 72 channels, got 36). These reproduce only on CUDA — clean on CPU and MPS with the same code + checkpoint + video. Probably cuDNN strictness or a torch 2.9.1 quirk with UNet skip-connection alignment on non-square inputs. Need a CUDA box to fix; filing as a separate follow-up issue.The standalone
centered_instancefactory dispatch (raisedUnsupported model_paths combinationin the bench) is also intentionally out of scope — it's a missing feature symmetric to PR 25's centroid-only support, with a similar footprint. Will be filed as PR 27.Test plan
pytest tests/inference/test_e2e_video.py— 10 passed (5 CPU + 5 MPS) on this Mac.pytest tests/inference/test_paf_worker_pool.py— 8 passed (spawn fix intact, no regressions).pytest tests/inference/ tests/cli/ tests/data/test_instance_centroids.py— 414 passed, 23 skipped (CUDA-gated).black --check sleap_nn tests— clean.ruff check sleap_nn/— clean.🤖 Generated with Claude Code