Skip to content

PR 26 of #508 — device-agnostic layer buffers + Linux spawn-context for PAF pool#563

Merged
gitttt-1234 merged 3 commits into
divya/inf-refactor-25-centroid-only-inferencefrom
divya/inf-refactor-26-bench-fixes
May 28, 2026
Merged

PR 26 of #508 — device-agnostic layer buffers + Linux spawn-context for PAF pool#563
gitttt-1234 merged 3 commits into
divya/inf-refactor-25-centroid-only-inferencefrom
divya/inf-refactor-26-bench-fixes

Conversation

@gitttt-1234

Copy link
Copy Markdown
Collaborator

Stacked on #562 (PR 25). Closes two bug families surfaced by the CUDA bench against PR 25's tip on an NVIDIA A40 box.

What the bench surfaced

Bug Where Reproduces on
Top-down predict_streaming device mismatch cuda:0 vs cpu and mps:0 vs cpu CUDA + MPS (not CPU)
paf_workers > 0 hangs forever first worker call Linux + CUDA (fork start-method default)

The first one I'd assumed was CUDA-specific. Testing on Mac MPS reproduced it with the exact same shape of error — confirming the root cause is device-agnostic anti-pattern in the new flow, not a CUDA quirk.

Root cause 1 — CPU-allocated output buffers across the layers

Several InferenceLayer subclasses allocated output buffers with bare torch.full((...), float("nan")) / torch.ones(B) calls. No device= kwarg. On CPU this is silent (everything is CPU); on any non-CPU device the scatter from the layer's device-resident tensors into a CPU buffer raises RuntimeError: Expected all tensors to be on the same device.

Files touched (the same anti-pattern fixed everywhere):

  • sleap_nn/inference/layers/topdown.py — 3 torch.full allocations now device-aware (stage2_kpts_img.device).
  • sleap_nn/inference/layers/centroid.pypadded_peaks, padded_vals, centroid_vals, both eff_scale allocations now device-aware in both the GT branch and the postprocess. This was the actual upstream source of the topdown failure — the centroid layer was returning CPU-resident pred_centroids even when running on MPS/CUDA, which then poisoned the scatter in topdown stage 2.
  • sleap_nn/inference/layers/centered_instance.pyb_idx, matched_vals, pred_centroid_values, eff_scale device-aware.
  • sleap_nn/inference/layers/single_instance.pyeff_scale device-aware (uses x.device).
  • sleap_nn/inference/layers/{bottomup, bottomup_multiclass, topdown_multiclass}.pyeff_scale device-aware via scaled.device.

Root cause 2 — PafGroupingPool fork-on-Linux deadlocks against CUDA

sleap_nn/inference/streaming.py:328 constructed its ProcessPoolExecutor without an explicit mp_context. On Linux this defaults to fork, which inherits the parent's already-initialized CUDA context and deadlocks the first worker call. Pinned mp_context=multiprocessing.get_context(\"spawn\") — matches the existing default on macOS / Windows.

Why the existing test suite missed all of this

Every pre-existing inference test either (a) used _StubLayer instead of a real backend, (b) used NumpyProvider with synthetic frames, or (c) mocked the factory. None exercised the actual video → preprocess → backend.forward → postprocess → Outputs chain on a real fixture. The new tests/inference/test_e2e_video.py plugs that gap.

10 parametrized tests = 5 CPU + 5 MPS-gated, one per supported model type (single_instance, centroid_only, topdown, bottomup, multiclass_bottomup). Pre-fix the MPS topdown case raised the device-mismatch error; post-fix all 10 pass.

Out of scope for this PR

The CUDA bench also showed channel-mismatch failures for centroid-only + bottom-up predict_streaming on a real video (weight=[36, 72, 3, 3], expected 72 channels, got 36). These reproduce only on CUDA — clean on CPU and MPS with the same code + checkpoint + video. Probably cuDNN strictness or a torch 2.9.1 quirk with UNet skip-connection alignment on non-square inputs. Need a CUDA box to fix; filing as a separate follow-up issue.

The standalone centered_instance factory dispatch (raised Unsupported model_paths combination in the bench) is also intentionally out of scope — it's a missing feature symmetric to PR 25's centroid-only support, with a similar footprint. Will be filed as PR 27.

Test plan

  • pytest tests/inference/test_e2e_video.py — 10 passed (5 CPU + 5 MPS) on this Mac.
  • pytest tests/inference/test_paf_worker_pool.py — 8 passed (spawn fix intact, no regressions).
  • pytest tests/inference/ tests/cli/ tests/data/test_instance_centroids.py — 414 passed, 23 skipped (CUDA-gated).
  • black --check sleap_nn tests — clean.
  • ruff check sleap_nn/ — clean.
  • Re-run the CUDA bench on the A40 box (section D + topdown C should both now pass; centroid + bottomup channel mismatches will persist — that's the separate issue).

🤖 Generated with Claude Code

…or PAF pool

Bench-surfaced bug fixes from the CUDA bench (#560 prereq) + a new MPS/CUDA
regression test. Closes the "topdown predict_streaming fails with device
mismatch" and "paf_workers>0 deadlocks on Linux+CUDA" issues found while
benchmarking PR 25's tip on an NVIDIA A40 box.

## Root causes

Several `InferenceLayer` subclasses allocated output buffers with bare
`torch.full((...), float("nan"))` / `torch.ones(B)` calls — no `device=`
kwarg. On CPU this is silent (everything is CPU); on **any non-CPU device**
the scatter from the layer's device-resident tensors into a CPU buffer
raises `RuntimeError: Expected all tensors to be on the same device`.

Reproduces on MPS (Mac M-series) with the exact same shape of error as the
CUDA bench reported, so the bug is **non-CPU-device-path** in scope, not
CUDA-specific.

Separately, `PafGroupingPool` constructs its `ProcessPoolExecutor` without
an explicit `mp_context`. On Linux this defaults to **fork**, which inherits
the parent's already-initialized CUDA context and deadlocks the first
worker call. The fix pins `mp_context=multiprocessing.get_context("spawn")`,
matching the existing default on macOS / Windows.

## Files

* `sleap_nn/inference/layers/topdown.py` — 3 `torch.full` allocations now
  pass `device=stage2_kpts_img.device` (the working scatter source).
* `sleap_nn/inference/layers/centroid.py` — `padded_peaks`, `padded_vals`,
  `centroid_vals`, and both `PreprocInfo.eff_scale` allocations now
  device-aware in both the GT branch and the postprocess.
* `sleap_nn/inference/layers/centered_instance.py` — `b_idx`,
  `matched_vals`, `pred_centroid_values`, and `eff_scale` device-aware.
* `sleap_nn/inference/layers/single_instance.py` — `eff_scale` device-aware
  (uses `x.device` since this layer has no `scaled` variable in scope).
* `sleap_nn/inference/layers/bottomup.py`,
  `sleap_nn/inference/layers/bottomup_multiclass.py`,
  `sleap_nn/inference/layers/topdown_multiclass.py` — `eff_scale`
  device-aware via `scaled.device`.
* `sleap_nn/inference/streaming.py` — `PafGroupingPool.__enter__` pins the
  `spawn` start method explicitly. Docstring updated.
* `tests/inference/test_e2e_video.py` (new, 10 tests = 5 CPU + 5 MPS-gated):
  real fixture ckpt → `VideoProvider(small_robot.mp4)` →
  `predict_streaming()` for every supported model type. Pre-fix the MPS
  `topdown` case raised the device-mismatch error; post-fix all 10 pass.

## Why the existing test suite missed these

Every pre-existing inference test either (a) used `_StubLayer` instead of a
real backend, (b) used `NumpyProvider` with synthetic frames, or (c) mocked
the factory. None exercised the actual `video → preprocess → backend.forward
→ postprocess → Outputs` chain on a real fixture. The new
`tests/inference/test_e2e_video.py` plugs that gap.

## Out of scope

The CUDA bench also showed two **channel-mismatch** failures
(centroid-only + bottom-up `predict_streaming` on real video — both reporting
`weight=[36, 72, 3, 3], expected 72 channels, got 36`). These reproduce
**only on CUDA** (clean on CPU and MPS with the same code + checkpoint +
video). Probably cuDNN strictness or a torch 2.9.1 + non-square input
interaction with UNet skip connections. Need CUDA hardware to fix; will
file as a separate issue with the bench traceback attached.

## Tests

```
tests/inference/test_e2e_video.py           10 passed (5 CPU + 5 MPS)
tests/inference/test_paf_worker_pool.py      8 passed (spawn-context fix
                                                       intact, no regressions)
tests/inference/ + cli/ + test_instance_centroids
                                           414 passed, 23 skipped (CUDA-gated)
black --check sleap_nn tests                clean
ruff check sleap_nn/                        clean
```

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@codecov

codecov Bot commented May 14, 2026

Copy link
Copy Markdown

Codecov Report

❌ Patch coverage is 68.41398% with 235 lines in your changes missing coverage. Please review.
✅ Project coverage is 64.46%. Comparing base (60575f1) to head (bff2525).

Files with missing lines Patch % Lines
sleap_nn/inference/predictor.py 48.37% 143 Missing ⚠️
sleap_nn/inference/run.py 14.28% 30 Missing ⚠️
sleap_nn/inference/loaders.py 88.26% 25 Missing ⚠️
sleap_nn/inference/layers/base.py 68.18% 21 Missing ⚠️
sleap_nn/cli.py 89.65% 6 Missing ⚠️
sleap_nn/inference/layers/centroid.py 70.58% 5 Missing ⚠️
sleap_nn/inference/layers/topdown.py 88.88% 5 Missing ⚠️
Additional details and impacted files
@@                                Coverage Diff                                @@
##           divya/inf-refactor-25-centroid-only-inference     #563      +/-   ##
=================================================================================
+ Coverage                                          64.20%   64.46%   +0.26%     
=================================================================================
  Files                                                124      125       +1     
  Lines                                              19066    19431     +365     
=================================================================================
+ Hits                                               12241    12527     +286     
- Misses                                              6825     6904      +79     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

…#564)

Stacked on #563 (PR 26). Closes the parity gaps surfaced by the
post-bench audit at
`scratch/2026-04-30-inference-refactor-implementation/parity_audit/parity_report.md`.

## Why this PR exists

The CUDA bench (#560) and subsequent audit revealed that the new
inference flow was **silently producing wrong outputs** on any video not
coincidentally matching the model's training dimensions. The PR-0 parity
goldens didn't catch it because they covered the wrong slice — pinning
model-forward parity (give the model the same preprocessed input, get
the same output) instead of pipeline parity (raw video → preprocess →
forward → postprocess → final keypoints).

The audit catalogued **10 numbered divergences** between legacy
`sleap_nn.inference.predictors.Predictor.from_model_paths` and new
`sleap_nn.inference.factory.from_model_paths`. This PR closes all of
them.

## What's in the PR (3 commits)

### 1. `PR 4 of #508 (deferred): shared full-preprocess helper across
InferenceLayers`

`InferenceLayer._apply_full_preprocess(x, max_stride,
unsqueeze_n_samples)` runs the legacy chain in order:
1. `ensure_rgb` / `ensure_grayscale` (channel coercion)
2. Per-sample `apply_sizematcher` to `(max_h, max_w)` → produces
`eff_scale` tensor
3. `resize_image` by `preprocess_config.scale` (input_scale)
4. `apply_pad_to_stride` to `max_stride`
5. `unsqueeze(dim=1)` for the n_samples Lightning-forward contract

Every raw-frame layer's `preprocess()` delegates to it:
`SingleInstance`, `Centroid`, `CenteredInstance`, `BottomUp`,
`BottomUpMultiClass`, `TopDownMultiClass`. Each step short-circuits when
its config field is the identity.

`SingleInstanceLayer.__init__` gains a `max_stride` arg (was missing).

### 2. `PR 8 + 11 of #508 (deferred): factory forwards preprocess
fields; .slp ingestion works; uint8 preserved`

Three fixes:

- **Factory wiring** — `factory.from_model_paths` now reads
`preprocess_config.{max_height, max_width, ensure_rgb,
ensure_grayscale}` off the legacy predictor (which resolves them from
`training_config.yaml`) and threads them into every layer's
`PreprocessConfig`. Centroid layers get the sizematcher fields;
centered-instance layers in topdown composition intentionally don't
(they receive per-instance crops, not raw frames — sizematcher there
would upsize the crops).
- **uint8 preservation** — split `_to_4d_float_tensor` into
`_to_4d_tensor` (layout only, dtype-preserving) + `_to_4d_float_tensor`
(thin float wrapper for backward compat). Every layer's `preprocess()`
uses `_to_4d_tensor` so uint8 stays uint8 through `tvf.resize`. The
eager `.float()` was producing `255.00006...` after resize, off-by-noise
from legacy's clean uint8 path. `normalize_on_gpu` inside the Lightning
forward handles uint8→float32 conversion.
- **`Predictor._batch_iter` instances kwarg** — only forwards
`batch.instances` to layers whose `predict` signature accepts the kwarg
(via `inspect.signature`). Pre-fix, `.slp` ingestion raised `TypeError`
on every layer except centroid/topdown.

### 3. `PR 27 of #508: topdown crops from sized image; permanent
parity-vs-legacy test`

- **TopDownLayer crops from the sized image** (post-sizematcher), not
the raw frame. Legacy `CentroidCrop` extracts `crop_hw` crops from the
sized image; the centered_instance model was trained on those
sized-space crops. The new flow was extracting crops from the raw frame,
producing crops covering a slightly different physical region (96×96 raw
pixels vs 96×96 sized pixels ≈ 140×140 raw pixels when
`eff_scale=0.686`). Median drift on topdown × small_robot.mp4 was ~15
px.

`TopDownLayer.predict` now re-applies the centroid layer's sizematcher
(via `_sizematch_like_centroid_layer`) to recover the sized image +
per-sample `eff_scale`, converts `centroids` back to sized space for
bbox construction, runs stage 2 in sized space, then divides the final
keypoints + bboxes by `eff_scale` to land in original-image space.
- **`tests/inference/test_parity_vs_legacy.py`** — permanent guardrail.
6 parametrized tests asserting final-keypoint parity between legacy and
new `Predictor` on every fixture × `{small_robot.mp4,
minimal_instance.pkg.slp}` within `atol/rtol=1e-4`.
-
**`tests/inference/layers/test_topdown.py::test_centroid_nms_dedupes_close_centroids`**
updated to stub the new `preprocess_config` + `_to_4d_tensor` attributes
on its `CentroidLayer.__new__(...)` mock.

## Final parity results

| fixture × source | model-input parity | final-keypoint parity |
|---|---|---|
| single_instance × small_robot.mp4 | ✓ identical | ✅ 0.0000 px (strict)
|
| single_instance × minimal_instance.pkg.slp | ✓ identical | ✅ 0.0000 px
(strict) |
| topdown × small_robot.mp4 | ✓ identical | ✅ 0.0001 px (strict) |
| topdown × minimal_instance.pkg.slp | ✓ both stages | ✅ 0.0000 px
(strict) |
| bottomup × small_robot.mp4 | ✓ identical | ✅ 0.0000 px (strict) |
| bottomup × minimal_instance.pkg.slp | ✓ identical | ✅ 0.0000 px
(strict) |

Pre-PR-27 the same audit showed:
- `single_instance × small_robot.mp4`: input shape `(4,3,320,560)` vs
legacy `(4,1,3,160,280)` (no input_scale, no n_samples wrap)
- `topdown / bottomup × small_robot.mp4`: input mean **53 vs 93**
(sizematcher missing entirely)
- `topdown × small_robot.mp4`: final keypoints **41.8 px max
nearest-neighbour drift** between flows
- `.slp` ingestion: `TypeError: InferenceLayer.predict() got unexpected
keyword argument 'instances'` on every non-centroid layer

## How this happened

The PR-0 goldens captured the model's input + output from the legacy
flow's `InferenceModel.forward`. The new layer tests then asserted that,
given the same model input, the layer produces the same model output.
That's model-forward parity; it doesn't exercise the preprocessing chain
(sizematcher / channel coercion / dtype / n_samples wrap).

Because the goldens were the only acceptance gate, every PR in the stack
passed "parity within 1e-5" while the preprocessing in the new flow was
silently incomplete. The first time real video frames entered through
`VideoProvider`, the divergence surfaced — visible in the CUDA bench as
a `RuntimeError` (channel mismatch on cuDNN) and on Mac CPU/MPS as
silently wrong predictions.

The new `tests/inference/test_parity_vs_legacy.py` is the gate that
should have existed since PR 0. It exercises the full
`from_model_paths(ckpt).predict(source)` pipeline and compares final
keypoints against legacy.

## Test plan

- [x] `pytest tests/inference/test_parity_vs_legacy.py` — 6 passed
(0.0000–0.0001 px max diff).
- [x] `pytest tests/inference/ tests/cli/
tests/data/test_instance_centroids.py` — 418 passed, 23 skipped
(CUDA-gated), 1 xfailed (PR-0 single-instance golden test, marked xfail
with note pointing here).
- [x] `pytest
tests/inference/layers/test_topdown.py::test_centroid_nms_dedupes_close_centroids`
— passes after stubbing.
- [x] `black --check sleap_nn tests` — clean.
- [x] `ruff check sleap_nn/` — clean.
- [ ] Re-run CUDA bench on the A40 box. Section C centroid + bottomup
channel-mismatch errors are expected to clear (they were sizematcher in
disguise).

## Out of scope

- The xfailed `test_single_instance_layer_parity_vs_pr0_golden` test was
written against the old Option-B contract (caller pre-preprocesses,
layer.preprocess is a no-op). PR 27 moves the layer to Option-A
(layer.predict(raw_frame) does the full pipeline) so feeding pre-scaled
input now double-scales. The new `test_parity_vs_legacy.py` supersedes
it as the parity guardrail.
- ONNX `Exported*Layer` adapters were not audited and likely have the
same anti-pattern (they bypass `_apply_full_preprocess`). Separate
follow-up after #560 reruns confirm the in-flow path is correct.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

---------

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@gitttt-1234 gitttt-1234 marked this pull request as ready for review May 28, 2026 17:36
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@gitttt-1234 gitttt-1234 merged commit cca171d into divya/inf-refactor-25-centroid-only-inference May 28, 2026
6 checks passed
@gitttt-1234 gitttt-1234 deleted the divya/inf-refactor-26-bench-fixes branch May 28, 2026 17:54
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant