PR 26 of #508 — device-agnostic layer buffers + Linux spawn-context for PAF pool by gitttt-1234 · Pull Request #563 · talmolab/sleap-nn

gitttt-1234 · 2026-05-14T23:14:34Z

Stacked on #562 (PR 25). Closes two bug families surfaced by the CUDA bench against PR 25's tip on an NVIDIA A40 box.

What the bench surfaced

Bug	Where	Reproduces on
Top-down `predict_streaming` device mismatch	`cuda:0 vs cpu` and `mps:0 vs cpu`	CUDA + MPS (not CPU)
`paf_workers > 0` hangs forever	first worker call	Linux + CUDA (fork start-method default)

The first one I'd assumed was CUDA-specific. Testing on Mac MPS reproduced it with the exact same shape of error — confirming the root cause is device-agnostic anti-pattern in the new flow, not a CUDA quirk.

Root cause 1 — CPU-allocated output buffers across the layers

Several InferenceLayer subclasses allocated output buffers with bare torch.full((...), float("nan")) / torch.ones(B) calls. No device= kwarg. On CPU this is silent (everything is CPU); on any non-CPU device the scatter from the layer's device-resident tensors into a CPU buffer raises RuntimeError: Expected all tensors to be on the same device.

Files touched (the same anti-pattern fixed everywhere):

sleap_nn/inference/layers/topdown.py — 3 torch.full allocations now device-aware (stage2_kpts_img.device).
sleap_nn/inference/layers/centroid.py — padded_peaks, padded_vals, centroid_vals, both eff_scale allocations now device-aware in both the GT branch and the postprocess. This was the actual upstream source of the topdown failure — the centroid layer was returning CPU-resident pred_centroids even when running on MPS/CUDA, which then poisoned the scatter in topdown stage 2.
sleap_nn/inference/layers/centered_instance.py — b_idx, matched_vals, pred_centroid_values, eff_scale device-aware.
sleap_nn/inference/layers/single_instance.py — eff_scale device-aware (uses x.device).
sleap_nn/inference/layers/{bottomup, bottomup_multiclass, topdown_multiclass}.py — eff_scale device-aware via scaled.device.

Root cause 2 — `PafGroupingPool` fork-on-Linux deadlocks against CUDA

sleap_nn/inference/streaming.py:328 constructed its ProcessPoolExecutor without an explicit mp_context. On Linux this defaults to fork, which inherits the parent's already-initialized CUDA context and deadlocks the first worker call. Pinned mp_context=multiprocessing.get_context(\"spawn\") — matches the existing default on macOS / Windows.

Why the existing test suite missed all of this

Every pre-existing inference test either (a) used _StubLayer instead of a real backend, (b) used NumpyProvider with synthetic frames, or (c) mocked the factory. None exercised the actual video → preprocess → backend.forward → postprocess → Outputs chain on a real fixture. The new tests/inference/test_e2e_video.py plugs that gap.

10 parametrized tests = 5 CPU + 5 MPS-gated, one per supported model type (single_instance, centroid_only, topdown, bottomup, multiclass_bottomup). Pre-fix the MPS topdown case raised the device-mismatch error; post-fix all 10 pass.

Out of scope for this PR

The CUDA bench also showed channel-mismatch failures for centroid-only + bottom-up predict_streaming on a real video (weight=[36, 72, 3, 3], expected 72 channels, got 36). These reproduce only on CUDA — clean on CPU and MPS with the same code + checkpoint + video. Probably cuDNN strictness or a torch 2.9.1 quirk with UNet skip-connection alignment on non-square inputs. Need a CUDA box to fix; filing as a separate follow-up issue.

The standalone centered_instance factory dispatch (raised Unsupported model_paths combination in the bench) is also intentionally out of scope — it's a missing feature symmetric to PR 25's centroid-only support, with a similar footprint. Will be filed as PR 27.

Test plan

pytest tests/inference/test_e2e_video.py — 10 passed (5 CPU + 5 MPS) on this Mac.
pytest tests/inference/test_paf_worker_pool.py — 8 passed (spawn fix intact, no regressions).
pytest tests/inference/ tests/cli/ tests/data/test_instance_centroids.py — 414 passed, 23 skipped (CUDA-gated).
black --check sleap_nn tests — clean.
ruff check sleap_nn/ — clean.
Re-run the CUDA bench on the A40 box (section D + topdown C should both now pass; centroid + bottomup channel mismatches will persist — that's the separate issue).

🤖 Generated with Claude Code

…or PAF pool Bench-surfaced bug fixes from the CUDA bench (#560 prereq) + a new MPS/CUDA regression test. Closes the "topdown predict_streaming fails with device mismatch" and "paf_workers>0 deadlocks on Linux+CUDA" issues found while benchmarking PR 25's tip on an NVIDIA A40 box. ## Root causes Several `InferenceLayer` subclasses allocated output buffers with bare `torch.full((...), float("nan"))` / `torch.ones(B)` calls — no `device=` kwarg. On CPU this is silent (everything is CPU); on **any non-CPU device** the scatter from the layer's device-resident tensors into a CPU buffer raises `RuntimeError: Expected all tensors to be on the same device`. Reproduces on MPS (Mac M-series) with the exact same shape of error as the CUDA bench reported, so the bug is **non-CPU-device-path** in scope, not CUDA-specific. Separately, `PafGroupingPool` constructs its `ProcessPoolExecutor` without an explicit `mp_context`. On Linux this defaults to **fork**, which inherits the parent's already-initialized CUDA context and deadlocks the first worker call. The fix pins `mp_context=multiprocessing.get_context("spawn")`, matching the existing default on macOS / Windows. ## Files * `sleap_nn/inference/layers/topdown.py` — 3 `torch.full` allocations now pass `device=stage2_kpts_img.device` (the working scatter source). * `sleap_nn/inference/layers/centroid.py` — `padded_peaks`, `padded_vals`, `centroid_vals`, and both `PreprocInfo.eff_scale` allocations now device-aware in both the GT branch and the postprocess. * `sleap_nn/inference/layers/centered_instance.py` — `b_idx`, `matched_vals`, `pred_centroid_values`, and `eff_scale` device-aware. * `sleap_nn/inference/layers/single_instance.py` — `eff_scale` device-aware (uses `x.device` since this layer has no `scaled` variable in scope). * `sleap_nn/inference/layers/bottomup.py`, `sleap_nn/inference/layers/bottomup_multiclass.py`, `sleap_nn/inference/layers/topdown_multiclass.py` — `eff_scale` device-aware via `scaled.device`. * `sleap_nn/inference/streaming.py` — `PafGroupingPool.__enter__` pins the `spawn` start method explicitly. Docstring updated. * `tests/inference/test_e2e_video.py` (new, 10 tests = 5 CPU + 5 MPS-gated): real fixture ckpt → `VideoProvider(small_robot.mp4)` → `predict_streaming()` for every supported model type. Pre-fix the MPS `topdown` case raised the device-mismatch error; post-fix all 10 pass. ## Why the existing test suite missed these Every pre-existing inference test either (a) used `_StubLayer` instead of a real backend, (b) used `NumpyProvider` with synthetic frames, or (c) mocked the factory. None exercised the actual `video → preprocess → backend.forward → postprocess → Outputs` chain on a real fixture. The new `tests/inference/test_e2e_video.py` plugs that gap. ## Out of scope The CUDA bench also showed two **channel-mismatch** failures (centroid-only + bottom-up `predict_streaming` on real video — both reporting `weight=[36, 72, 3, 3], expected 72 channels, got 36`). These reproduce **only on CUDA** (clean on CPU and MPS with the same code + checkpoint + video). Probably cuDNN strictness or a torch 2.9.1 + non-square input interaction with UNet skip connections. Need CUDA hardware to fix; will file as a separate issue with the bench traceback attached. ## Tests ``` tests/inference/test_e2e_video.py 10 passed (5 CPU + 5 MPS) tests/inference/test_paf_worker_pool.py 8 passed (spawn-context fix intact, no regressions) tests/inference/ + cli/ + test_instance_centroids 414 passed, 23 skipped (CUDA-gated) black --check sleap_nn tests clean ruff check sleap_nn/ clean ``` Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

codecov · 2026-05-14T23:19:12Z

Codecov Report

❌ Patch coverage is 68.41398% with 235 lines in your changes missing coverage. Please review.
✅ Project coverage is 64.46%. Comparing base (60575f1) to head (bff2525).

Files with missing lines	Patch %	Lines
sleap_nn/inference/predictor.py	48.37%	143 Missing ⚠️
sleap_nn/inference/run.py	14.28%	30 Missing ⚠️
sleap_nn/inference/loaders.py	88.26%	25 Missing ⚠️
sleap_nn/inference/layers/base.py	68.18%	21 Missing ⚠️
sleap_nn/cli.py	89.65%	6 Missing ⚠️
sleap_nn/inference/layers/centroid.py	70.58%	5 Missing ⚠️
sleap_nn/inference/layers/topdown.py	88.88%	5 Missing ⚠️

Additional details and impacted files

@@                                Coverage Diff                                @@
##           divya/inf-refactor-25-centroid-only-inference     #563      +/-   ##
=================================================================================
+ Coverage                                          64.20%   64.46%   +0.26%     
=================================================================================
  Files                                                124      125       +1     
  Lines                                              19066    19431     +365     
=================================================================================
+ Hits                                               12241    12527     +286     
- Misses                                              6825     6904      +79

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

…#564) Stacked on #563 (PR 26). Closes the parity gaps surfaced by the post-bench audit at `scratch/2026-04-30-inference-refactor-implementation/parity_audit/parity_report.md`. ## Why this PR exists The CUDA bench (#560) and subsequent audit revealed that the new inference flow was **silently producing wrong outputs** on any video not coincidentally matching the model's training dimensions. The PR-0 parity goldens didn't catch it because they covered the wrong slice — pinning model-forward parity (give the model the same preprocessed input, get the same output) instead of pipeline parity (raw video → preprocess → forward → postprocess → final keypoints). The audit catalogued **10 numbered divergences** between legacy `sleap_nn.inference.predictors.Predictor.from_model_paths` and new `sleap_nn.inference.factory.from_model_paths`. This PR closes all of them. ## What's in the PR (3 commits) ### 1. `PR 4 of #508 (deferred): shared full-preprocess helper across InferenceLayers` `InferenceLayer._apply_full_preprocess(x, max_stride, unsqueeze_n_samples)` runs the legacy chain in order: 1. `ensure_rgb` / `ensure_grayscale` (channel coercion) 2. Per-sample `apply_sizematcher` to `(max_h, max_w)` → produces `eff_scale` tensor 3. `resize_image` by `preprocess_config.scale` (input_scale) 4. `apply_pad_to_stride` to `max_stride` 5. `unsqueeze(dim=1)` for the n_samples Lightning-forward contract Every raw-frame layer's `preprocess()` delegates to it: `SingleInstance`, `Centroid`, `CenteredInstance`, `BottomUp`, `BottomUpMultiClass`, `TopDownMultiClass`. Each step short-circuits when its config field is the identity. `SingleInstanceLayer.__init__` gains a `max_stride` arg (was missing). ### 2. `PR 8 + 11 of #508 (deferred): factory forwards preprocess fields; .slp ingestion works; uint8 preserved` Three fixes: - **Factory wiring** — `factory.from_model_paths` now reads `preprocess_config.{max_height, max_width, ensure_rgb, ensure_grayscale}` off the legacy predictor (which resolves them from `training_config.yaml`) and threads them into every layer's `PreprocessConfig`. Centroid layers get the sizematcher fields; centered-instance layers in topdown composition intentionally don't (they receive per-instance crops, not raw frames — sizematcher there would upsize the crops). - **uint8 preservation** — split `_to_4d_float_tensor` into `_to_4d_tensor` (layout only, dtype-preserving) + `_to_4d_float_tensor` (thin float wrapper for backward compat). Every layer's `preprocess()` uses `_to_4d_tensor` so uint8 stays uint8 through `tvf.resize`. The eager `.float()` was producing `255.00006...` after resize, off-by-noise from legacy's clean uint8 path. `normalize_on_gpu` inside the Lightning forward handles uint8→float32 conversion. - **`Predictor._batch_iter` instances kwarg** — only forwards `batch.instances` to layers whose `predict` signature accepts the kwarg (via `inspect.signature`). Pre-fix, `.slp` ingestion raised `TypeError` on every layer except centroid/topdown. ### 3. `PR 27 of #508: topdown crops from sized image; permanent parity-vs-legacy test` - **TopDownLayer crops from the sized image** (post-sizematcher), not the raw frame. Legacy `CentroidCrop` extracts `crop_hw` crops from the sized image; the centered_instance model was trained on those sized-space crops. The new flow was extracting crops from the raw frame, producing crops covering a slightly different physical region (96×96 raw pixels vs 96×96 sized pixels ≈ 140×140 raw pixels when `eff_scale=0.686`). Median drift on topdown × small_robot.mp4 was ~15 px. `TopDownLayer.predict` now re-applies the centroid layer's sizematcher (via `_sizematch_like_centroid_layer`) to recover the sized image + per-sample `eff_scale`, converts `centroids` back to sized space for bbox construction, runs stage 2 in sized space, then divides the final keypoints + bboxes by `eff_scale` to land in original-image space. - **`tests/inference/test_parity_vs_legacy.py`** — permanent guardrail. 6 parametrized tests asserting final-keypoint parity between legacy and new `Predictor` on every fixture × `{small_robot.mp4, minimal_instance.pkg.slp}` within `atol/rtol=1e-4`. - **`tests/inference/layers/test_topdown.py::test_centroid_nms_dedupes_close_centroids`** updated to stub the new `preprocess_config` + `_to_4d_tensor` attributes on its `CentroidLayer.__new__(...)` mock. ## Final parity results | fixture × source | model-input parity | final-keypoint parity | |---|---|---| | single_instance × small_robot.mp4 | ✓ identical | ✅ 0.0000 px (strict) | | single_instance × minimal_instance.pkg.slp | ✓ identical | ✅ 0.0000 px (strict) | | topdown × small_robot.mp4 | ✓ identical | ✅ 0.0001 px (strict) | | topdown × minimal_instance.pkg.slp | ✓ both stages | ✅ 0.0000 px (strict) | | bottomup × small_robot.mp4 | ✓ identical | ✅ 0.0000 px (strict) | | bottomup × minimal_instance.pkg.slp | ✓ identical | ✅ 0.0000 px (strict) | Pre-PR-27 the same audit showed: - `single_instance × small_robot.mp4`: input shape `(4,3,320,560)` vs legacy `(4,1,3,160,280)` (no input_scale, no n_samples wrap) - `topdown / bottomup × small_robot.mp4`: input mean **53 vs 93** (sizematcher missing entirely) - `topdown × small_robot.mp4`: final keypoints **41.8 px max nearest-neighbour drift** between flows - `.slp` ingestion: `TypeError: InferenceLayer.predict() got unexpected keyword argument 'instances'` on every non-centroid layer ## How this happened The PR-0 goldens captured the model's input + output from the legacy flow's `InferenceModel.forward`. The new layer tests then asserted that, given the same model input, the layer produces the same model output. That's model-forward parity; it doesn't exercise the preprocessing chain (sizematcher / channel coercion / dtype / n_samples wrap). Because the goldens were the only acceptance gate, every PR in the stack passed "parity within 1e-5" while the preprocessing in the new flow was silently incomplete. The first time real video frames entered through `VideoProvider`, the divergence surfaced — visible in the CUDA bench as a `RuntimeError` (channel mismatch on cuDNN) and on Mac CPU/MPS as silently wrong predictions. The new `tests/inference/test_parity_vs_legacy.py` is the gate that should have existed since PR 0. It exercises the full `from_model_paths(ckpt).predict(source)` pipeline and compares final keypoints against legacy. ## Test plan - [x] `pytest tests/inference/test_parity_vs_legacy.py` — 6 passed (0.0000–0.0001 px max diff). - [x] `pytest tests/inference/ tests/cli/ tests/data/test_instance_centroids.py` — 418 passed, 23 skipped (CUDA-gated), 1 xfailed (PR-0 single-instance golden test, marked xfail with note pointing here). - [x] `pytest tests/inference/layers/test_topdown.py::test_centroid_nms_dedupes_close_centroids` — passes after stubbing. - [x] `black --check sleap_nn tests` — clean. - [x] `ruff check sleap_nn/` — clean. - [ ] Re-run CUDA bench on the A40 box. Section C centroid + bottomup channel-mismatch errors are expected to clear (they were sizematcher in disguise). ## Out of scope - The xfailed `test_single_instance_layer_parity_vs_pr0_golden` test was written against the old Option-B contract (caller pre-preprocesses, layer.preprocess is a no-op). PR 27 moves the layer to Option-A (layer.predict(raw_frame) does the full pipeline) so feeding pre-scaled input now double-scales. The new `test_parity_vs_legacy.py` supersedes it as the parity guardrail. - ONNX `Exported*Layer` adapters were not audited and likely have the same anti-pattern (they bypass `_apply_full_preprocess`). Separate follow-up after #560 reruns confirm the in-flow path is correct. 🤖 Generated with [Claude Code](https://claude.com/claude-code) --------- Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

gitttt-1234 mentioned this pull request May 15, 2026

PR 27 of #508 — preprocessing parity with legacy (audit-driven fixes) #564

Merged

6 tasks

gitttt-1234 marked this pull request as ready for review May 28, 2026 17:36

fix: remove Mac skip on training tests

bff2525

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

gitttt-1234 merged commit cca171d into divya/inf-refactor-25-centroid-only-inference May 28, 2026
6 checks passed

gitttt-1234 deleted the divya/inf-refactor-26-bench-fixes branch May 28, 2026 17:54

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

PR 26 of #508 — device-agnostic layer buffers + Linux spawn-context for PAF pool#563

PR 26 of #508 — device-agnostic layer buffers + Linux spawn-context for PAF pool#563
gitttt-1234 merged 3 commits into
divya/inf-refactor-25-centroid-only-inferencefrom
divya/inf-refactor-26-bench-fixes

gitttt-1234 commented May 14, 2026

Uh oh!

codecov Bot commented May 14, 2026 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

gitttt-1234 commented May 14, 2026

What the bench surfaced

Root cause 1 — CPU-allocated output buffers across the layers

Root cause 2 — PafGroupingPool fork-on-Linux deadlocks against CUDA

Why the existing test suite missed all of this

Out of scope for this PR

Test plan

Uh oh!

codecov Bot commented May 14, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Root cause 2 — `PafGroupingPool` fork-on-Linux deadlocks against CUDA

codecov Bot commented May 14, 2026 •

edited

Loading