Skip to content

Hard-fail config validation when bottom-up pipeline is selected with zero / disconnected skeleton edges #567

@talmo

Description

@talmo

sleap-nn issue draft — Validation for bottom-up + zero/disconnected edges

Target repo: talmolab/sleap-nn
Suggested labels: enhancement
Cross-ref: Migrated from talmolab/sleap#1247


Title

Hard-fail config validation when bottom-up pipeline is selected with zero / disconnected skeleton edges

Body

Problem

The bottomup and multi_class_bottomup pipelines compute Part-Affinity Fields over skeleton edges. When the user's skeleton has no usable edges, training fails with a cryptic error (or, worse, silently constructs a model with a zero-channel PAF head). There are three sub-cases that should all be caught up-front:

  1. Single-node skeleton (len(skeleton.nodes) == 1, therefore len(edges) == 0).
  2. Multi-node skeleton with zero edges declared (len(edges) == 0).
  3. Multi-node skeleton with disconnected components (the edge graph has > 1 connected component, so PAF-based grouping cannot assemble a full instance).

Originally reported in sleap#1247 for the legacy TF backend; the same structural gap exists in sleap-nn.

Where it crashes today

If a user manually selects bottomup with a zero-edge skeleton, the failure points are:

  • sleap_nn/data/custom_datasets.py:838self.edge_inds = labels[0].skeletons[0].edge_inds returns an empty tensor.
  • sleap_nn/data/edge_maps.py:242source_inds = edge_inds[:, 0].to(torch.int32) fails on a rank-1 empty tensor with a low-level shape error.
  • sleap_nn/architectures/heads.py:333PartAffinityFieldsHead.channels returns 0, producing a degenerate head.

The disconnected-graph case (sub-case 3) currently has no validation at all and would train to a low-quality model without surfacing the issue.

Existing precedent (steering, not validation)

The config recommender already detects sub-case 2 and steers users toward centroid:

# sleap_nn/config_generator/recommender.py:140-153
if stats.num_edges == 0:
    warnings.append("No edges in skeleton - bottom-up requires edges for PAFs")
    ...
    return PipelineRecommendation(
        recommended="centroid",
        reason="No skeleton edges available for bottom-up",
        ...
    )

But the recommender only fires from the TUI / config-generation flow. A user who hand-writes a training config, or whose frontend bypasses the recommender, still hits the crash.

Proposed change

Add a semantic check in verify_training_cfg() (currently at sleap_nn/config/training_job_config.py:114-125 — only validates schema/required fields):

def verify_training_cfg(cfg: DictConfig) -> DictConfig:
    ...
    check_must_be_set(config)
    check_pipeline_skeleton_compatibility(config)   # ← new
    return config

check_pipeline_skeleton_compatibility should:

  • If head_configs.bottomup (or multi_class_bottomup) is set, inspect the skeleton edges from data_config.skeletons (or the loaded labels).
  • Sub-case 1 (single node): Raise with a message like:

    Bottom-up training requires a multi-node skeleton with edges. Your skeleton has 1 node — use 'single_instance' (single animal per frame) or 'centroid' (top-down for multiple instances of a single landmark) instead.

  • Sub-case 2 (zero edges, multi-node): Raise with a message recommending adding edges to the skeleton OR switching to centroid (top-down).
  • Sub-case 3 (disconnected components): Raise with a message identifying the disconnected nodes and recommending either adding bridge edges or switching to top-down.

Implementation note for sub-case 3: a simple union-find over skeleton.edge_inds is sufficient; no external graph library needed.

Where the check should live

Putting it in verify_training_cfg ensures it fires before ModelTrainer.get_model_trainer_from_config() (sleap_nn/training/model_trainer.py:118) instantiates the data pipeline (sleap_nn/data/custom_datasets.py:2135+ for the bottom-up dispatch), so the user gets a clean error instead of a stack trace.

Acceptance criteria

  • verify_training_cfg raises a clear, actionable ValueError (or dedicated config-validation exception) for each of the three sub-cases.
  • Error messages name a specific recommended alternative pipeline.
  • Unit tests cover all three sub-cases plus the happy path (multi-node, connected skeleton).
  • Recommender (config_generator/recommender.py) is updated to also catch sub-case 3 (disconnected components), for parity.

Related

  • talmolab/sleap#1247 — original 2023 enhancement request (legacy TF backend; being closed in favor of this issue).
  • A companion issue on the frontend/orchestration side will track surfacing this in the model-config UX so users see the warning before submitting an invalid config.

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions