Skip to content

Refactor: extract a shared builder/spec matcher used by ObjectMapper and the validator #1456

@rly

Description

@rly

Problem

HDMF has two independent implementations of the same problem: pair builders against the candidate subspecs in a parent's spec.

  • Build/read matcher: ObjectMapper.__get_subspec_values and __get_sub_builders in src/hdmf/build/objectmapper.py. Used during mapper.build (write) and mapper.construct (read) to pair sub-builders to subspecs and recurse.
  • Validator matcher: SpecMatcher class in src/hdmf/validate/validator.py:741. Used during validation to pair builders against expected specs.

The two diverged organically. They handle the same fundamental problem with different algorithms and different bug profiles.

How they differ today

Concern Build/read matcher Validator matcher
Direction iterate subspecs, find matching builders iterate builders, find best matching spec
Name matching yes yes
Data-type matching yes (with namespace hierarchy walk) yes (no hierarchy walk)
Quantity-aware tie-breaking no yes (prefers unsatisfied required slots)
Unmatched-builder tracking no (silently dropped) yes (warned about)
Coupling inline manager.construct(...) calls; sets builder.matched_spec pure pairing, validator acts in second pass

The validator already notes its algorithm has issues: # TODO future improvements to this matching algorithm should resolve these discrepancies (validator.py:794-797). The build/read matcher has the same kind of issues but no such note.

Proposed shared module

Lift the pairing logic into a pure helper, e.g. hdmf.build.matcher or similar, with a signature roughly like:

def match_subspecs(parent_spec, sub_builders, namespace_catalog):
    \"\"\"Pair sub-builders against subspecs of parent_spec.

    Returns:
        matches: dict[Spec, list[Builder]] - subspecmatched sub-builders.
        unmatched: list[Builder] - sub-builders that didn't match any subspec.
    \"\"\"

Properties of the shared helper:

  • Pure: no callbacks into manager.construct, no side effects on builders. Returns a mapping; consumers act on it.
  • Quantity-aware: when multiple candidates fit, prefer unsatisfied required slots first (the validator's algorithm).
  • Tracks unmatched: returns a list of sub-builders that didn't match any subspec, so consumers can warn or error.
  • Hierarchy-aware: walks the namespace inheritance hierarchy when matching by data_type (the build/read matcher does this; the validator doesn't).

Consumers:

  • ObjectMapper.__get_subspec_values calls the helper, then iterates matches to call manager.construct(builder) (read) or already-pre-created builders (write). After Phase 4 of Refactor: matched specs on builders (replace spec_ext plumbing) #1448, the matcher also sets builder.matched_spec = subspec — that side-effect can stay in the consumer or move into the shared helper as an option.
  • SpecMatcher in the validator becomes a thin wrapper around the helper, plus the existing per-builder validation.

Improvements that would propagate

After consolidation:

  1. Unmatched-builder warnings on read/write. Currently if a parent has an unexpected child (typo in a column name, stale data from an older spec version), the build/read matcher silently drops it. The validator already flags these. Sharing the matcher gives the same warning to read/write paths, at minimum as a debug log.
  2. Quantity-aware tie-breaking everywhere. When a parent has multiple candidate subspecs of the same type with different quantities (e.g., one required and one optional), and there's exactly one builder of that type, today's build/read matcher arbitrarily assigns to the first matching subspec. The validator's algorithm prefers unsatisfied required slots. Real correctness improvement, propagates to read/write.
  3. Hierarchy-aware matching in the validator. The validator's matcher doesn't walk the namespace inheritance hierarchy, so a builder of type MySpecificThing may not match a subspec calling for the parent type MyGenericThing. The build/read matcher's __get_sub_builders does walk namespace_catalog.get_hierarchy(...). Sharing fixes the validator's gap.

What's not in the shared matcher

  • Per-builder content validation (dtype, shape, required attributes, refs). Those are post-match concerns; the matcher's job is just pairing.
  • Construction (manager.construct). Stays in ObjectMapper.
  • The builder.matched_spec side-effect from Refactor: matched specs on builders (replace spec_ext plumbing) #1448's Phase 2/4 work could live in the matcher (as an opt-in argument) or stay in ObjectMapper. Probably cleanest in the consumer.

Migration

  • Phase A: introduce the shared helper; have both ObjectMapper.__get_sub_builders and SpecMatcher delegate to it. Behavior should be identical to whichever consumer-side algorithm we standardize on (probably the validator's, with the build/read matcher's hierarchy walk added).
  • Phase B: remove duplicate code in SpecMatcher and __get_sub_builders.
  • Phase C: turn on the new behavior (quantity-aware tie-breaking, unmatched-builder logging) in build/read paths. May require updating tests that depend on the old arbitrary-tie behavior.

Why now (context)

This came up while reviewing the matched-specs refactor in #1448. The build/read matcher is the natural place to attach builder.matched_spec, and looking at its responsibilities made the duplication with the validator's matcher visible. Worth doing during validator integration, not before — but worth tracking now so we don't quietly let the two diverge further.

References

Metadata

Metadata

Assignees

No one assigned

    Labels

    priority: lowalternative solution already working and/or relevant to only specific user(s)topic: maintenanceIssues related to tech debt / code maintainability

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions