sleap-nn issue draft — Validation for bottom-up + zero/disconnected edges
Target repo: talmolab/sleap-nn
Suggested labels: enhancement
Cross-ref: Migrated from talmolab/sleap#1247
Title
Hard-fail config validation when bottom-up pipeline is selected with zero / disconnected skeleton edges
Body
Problem
The bottomup and multi_class_bottomup pipelines compute Part-Affinity Fields over skeleton edges. When the user's skeleton has no usable edges, training fails with a cryptic error (or, worse, silently constructs a model with a zero-channel PAF head). There are three sub-cases that should all be caught up-front:
- Single-node skeleton (
len(skeleton.nodes) == 1, therefore len(edges) == 0).
- Multi-node skeleton with zero edges declared (
len(edges) == 0).
- Multi-node skeleton with disconnected components (the edge graph has > 1 connected component, so PAF-based grouping cannot assemble a full instance).
Originally reported in sleap#1247 for the legacy TF backend; the same structural gap exists in sleap-nn.
Where it crashes today
If a user manually selects bottomup with a zero-edge skeleton, the failure points are:
sleap_nn/data/custom_datasets.py:838 — self.edge_inds = labels[0].skeletons[0].edge_inds returns an empty tensor.
sleap_nn/data/edge_maps.py:242 — source_inds = edge_inds[:, 0].to(torch.int32) fails on a rank-1 empty tensor with a low-level shape error.
sleap_nn/architectures/heads.py:333 — PartAffinityFieldsHead.channels returns 0, producing a degenerate head.
The disconnected-graph case (sub-case 3) currently has no validation at all and would train to a low-quality model without surfacing the issue.
Existing precedent (steering, not validation)
The config recommender already detects sub-case 2 and steers users toward centroid:
# sleap_nn/config_generator/recommender.py:140-153
if stats.num_edges == 0:
warnings.append("No edges in skeleton - bottom-up requires edges for PAFs")
...
return PipelineRecommendation(
recommended="centroid",
reason="No skeleton edges available for bottom-up",
...
)
But the recommender only fires from the TUI / config-generation flow. A user who hand-writes a training config, or whose frontend bypasses the recommender, still hits the crash.
Proposed change
Add a semantic check in verify_training_cfg() (currently at sleap_nn/config/training_job_config.py:114-125 — only validates schema/required fields):
def verify_training_cfg(cfg: DictConfig) -> DictConfig:
...
check_must_be_set(config)
check_pipeline_skeleton_compatibility(config) # ← new
return config
check_pipeline_skeleton_compatibility should:
- If
head_configs.bottomup (or multi_class_bottomup) is set, inspect the skeleton edges from data_config.skeletons (or the loaded labels).
- Sub-case 1 (single node): Raise with a message like:
Bottom-up training requires a multi-node skeleton with edges. Your skeleton has 1 node — use 'single_instance' (single animal per frame) or 'centroid' (top-down for multiple instances of a single landmark) instead.
- Sub-case 2 (zero edges, multi-node): Raise with a message recommending adding edges to the skeleton OR switching to
centroid (top-down).
- Sub-case 3 (disconnected components): Raise with a message identifying the disconnected nodes and recommending either adding bridge edges or switching to top-down.
Implementation note for sub-case 3: a simple union-find over skeleton.edge_inds is sufficient; no external graph library needed.
Where the check should live
Putting it in verify_training_cfg ensures it fires before ModelTrainer.get_model_trainer_from_config() (sleap_nn/training/model_trainer.py:118) instantiates the data pipeline (sleap_nn/data/custom_datasets.py:2135+ for the bottom-up dispatch), so the user gets a clean error instead of a stack trace.
Acceptance criteria
Related
- talmolab/sleap#1247 — original 2023 enhancement request (legacy TF backend; being closed in favor of this issue).
- A companion issue on the frontend/orchestration side will track surfacing this in the model-config UX so users see the warning before submitting an invalid config.
sleap-nn issue draft — Validation for bottom-up + zero/disconnected edges
Target repo:
talmolab/sleap-nnSuggested labels:
enhancementCross-ref: Migrated from talmolab/sleap#1247
Title
Hard-fail config validation when bottom-up pipeline is selected with zero / disconnected skeleton edgesBody
Problem
The
bottomupandmulti_class_bottomuppipelines compute Part-Affinity Fields over skeleton edges. When the user's skeleton has no usable edges, training fails with a cryptic error (or, worse, silently constructs a model with a zero-channel PAF head). There are three sub-cases that should all be caught up-front:len(skeleton.nodes) == 1, thereforelen(edges) == 0).len(edges) == 0).Originally reported in sleap#1247 for the legacy TF backend; the same structural gap exists in sleap-nn.
Where it crashes today
If a user manually selects
bottomupwith a zero-edge skeleton, the failure points are:sleap_nn/data/custom_datasets.py:838—self.edge_inds = labels[0].skeletons[0].edge_indsreturns an empty tensor.sleap_nn/data/edge_maps.py:242—source_inds = edge_inds[:, 0].to(torch.int32)fails on a rank-1 empty tensor with a low-level shape error.sleap_nn/architectures/heads.py:333—PartAffinityFieldsHead.channelsreturns0, producing a degenerate head.The disconnected-graph case (sub-case 3) currently has no validation at all and would train to a low-quality model without surfacing the issue.
Existing precedent (steering, not validation)
The config recommender already detects sub-case 2 and steers users toward
centroid:But the recommender only fires from the TUI / config-generation flow. A user who hand-writes a training config, or whose frontend bypasses the recommender, still hits the crash.
Proposed change
Add a semantic check in
verify_training_cfg()(currently atsleap_nn/config/training_job_config.py:114-125— only validates schema/required fields):check_pipeline_skeleton_compatibilityshould:head_configs.bottomup(ormulti_class_bottomup) is set, inspect the skeleton edges fromdata_config.skeletons(or the loaded labels).centroid(top-down).Implementation note for sub-case 3: a simple union-find over
skeleton.edge_indsis sufficient; no external graph library needed.Where the check should live
Putting it in
verify_training_cfgensures it fires beforeModelTrainer.get_model_trainer_from_config()(sleap_nn/training/model_trainer.py:118) instantiates the data pipeline (sleap_nn/data/custom_datasets.py:2135+for the bottom-up dispatch), so the user gets a clean error instead of a stack trace.Acceptance criteria
verify_training_cfgraises a clear, actionableValueError(or dedicated config-validation exception) for each of the three sub-cases.config_generator/recommender.py) is updated to also catch sub-case 3 (disconnected components), for parity.Related