Skip to content

Review: Data & Domain (legal model, schema, database, test coverage) #477

@ehotting

Description

@ehotting

Data & Domain Review

Comprehensive review from 4 expert perspectives: Legal Domain Model, YAML Schema Design, Database & Query Patterns, and Test Coverage & BDD Gaps.


🔴 Critical / High Severity

Legal model: Temporal gaps

  • No valid_until / expiry: Laws never expire — a repealed law from 2026 is still returned for 2027 queries. Add valid_until field and filter in select_version_for_date.
  • No retroactive effect (terugwerkende kracht): Can't model laws that apply to facts before their valid_from. Add optional retroactive_from field.
  • No transitional provisions (overgangsbepalingen): Can't express "apply old rates to claims submitted before date X" — requires per-law version pins or article-level transition conditions.
  • Impact: Engine is not safe for historical recalculation scenarios, which are routine in Dutch administrative practice.

Legal model: Missing operations

Schema: 4 engine operations missing from JSON schema

  • NOT_EQUALS, IS_NULL, NOT_NULL, NOT_IN — accepted by engine but fail schema validation (schema/v0.5.1/schema.json)
  • YAML files using these operations pass the engine but fail just validate

Schema: Version drift — corpus pinned to v0.5.0, engine ships v0.5.1

  • All corpus files except test_untranslatables reference v0.5.0; schema/latest symlink is stale
  • untranslatables field (v0.5.1-only) can't be validated against declared schema

Database: law_entries.status can desync from actual job states

  • Pre-job status updates (worker.rs:155-161) are fire-and-forget with if let Err(e) — errors only logged
  • Post-success updates (reset_fail_count, upsert_law) are outside the completion transaction
  • Fix: Combine status update into claim_job CTE; put all post-success updates in one transaction

Test coverage: Major gaps in critical paths

  • IsNull/NotNull operations: Zero unit tests — null coercion bugs are silent eligibility errors
  • NotEquals/NotIn operations: Zero unit tests
  • Hook/override resolution: BDD-only, no unit-level tests for RFC-007 features
  • ExecutionReceipt: Zero tests — audit artifact regressions invisible
  • Worker process (worker.rs, 974 lines): Zero tests for job processing logic
  • Error golden tests: Only 2 of 15+ error variants covered
  • Cross-law golden tests: Only 2 cases (simple + parameter flow-through), no multi-hop chains
  • Enrich pipeline: LLM interaction logic entirely untested at integration level

🟡 Medium Severity

Legal model gaps

  • Scope limited to gemeente_code: Provinces and water boards lack first-class support despite RegulatoryLayer variants existing. matches_scope only checks gemeente_code (resolver.rs:292).
  • Hook ordering non-deterministic: Same-priority hooks fire in HashMap insertion order. Sort by (layer_rank, valid_from) for auditability.
  • Override validation: No load-time check that override target (law_id, article, output) exists. Cross-law load order not enforced.
  • #datum_inwerkingtreding unresolved: valid_from with # reference causes priority comparison to fail (priority.rs:9-15 TODO).
  • No model for discretionary powers: untranslatable_type should distinguish "engine limitation" from "inherently discretionary".
  • Co-delegation: Multiple implementing regulations can't combine additively — only winner-takes-all.

Schema design issues

  • Top-level missing additionalProperties: false: Typos in field names pass silently. $id, $schema, uuid undeclared.
  • definitions map allows any structure: No constraint on value key pattern.
  • source: {} semantically ambiguous: Three modes (registry, same-law, cross-law) indistinguishable by schema. Use discriminated oneOf.
  • action allows empty body: Only output required, no value/operation — produces nothing at runtime.
  • machineReadableSection.requires shape mismatch: Schema defines objects, Rust expects Vec<String>.
  • Schema $id uses mutable main branch URL: Inconsistent with tag-based corpus convention (RFC-013).
  • unit enum missing percentage: Engine accepts it, schema rejects it.

Database issues

  • No index on law_entries.status: Admin API filters/counts by status heavily — full sequential scan.
  • No index on jobs.started_at WHERE status='processing': Orphan reaper scans full table every poll cycle.
  • No index on jobs.completed_at WHERE status='failed': Metrics queries degrade as jobs table grows.
  • Single enrich_job_id column: Can't track multiple providers. Replace with junction table.
  • Harvest job TOCTOU race: create_harvest_job_if_not_exists has documented race condition — no unique constraint (unlike enrich jobs which use ON CONFLICT DO NOTHING).
  • Reaper runs every poll cycle without throttling: Hundreds of times/second during job bursts. Add last_reap: Instant throttle.
  • List endpoint fetches full JSONB: payload, result, progress returned for all rows — read amplification.

Test gaps (medium)

  • BDD: Participatiewet Art 11 (nationality/means test) — primary eligibility gates untested
  • Priority: lex specialis — not unit-tested (only BDD)
  • filter_parameters_for_article and resolve_inputs_with_service — complex hot-path helpers with no direct tests
  • Harvester YAML writer — zero tests for serialization to corpus format
  • Corpus validation — only 5 test cases, missing: unknown regulatory layer, malformed $id, invalid dates
  • No property-based tests: Evaluation determinism, arithmetic commutativity untested

🟢 Low Severity

Legal model

  • Single produces per article — can't model conditional legal character (RFC-001 open question)
  • Subdelegation chain validation: AMvB can't grant wider delegation than wet gave it — not enforced
  • Self-reference via implements index not detected at load time

Schema

  • legalBasis has no required fields — empty annotations schema-valid
  • open_terms[].type enum narrower than baseField.typeobject/array excluded
  • valid_from: #reference syntax undocumented with no resolution semantics
  • procedure.stages have no transition model or name constraints
  • Three regulatory layers (KONINKLIJK_BESLUIT, BELEIDSREGEL, UITVOERINGSBELEID) have no ID constraints

Database

  • Count + data as two non-transactional queries (inconsistent pagination snapshots)
  • claim_job correlated subquery vs CTE pattern (minor planner concern)
  • Unbounded list_jobs/list_laws internal helpers (no LIMIT)
  • hashtext() advisory lock uses 32-bit hash — collision risk (theoretical)
  • update_progress redundant updated_at SET + no status = 'processing' guard
  • Migration 0007 uses ALTER TYPE requiring ACCESS EXCLUSIVE lock
  • Migrations 0004/0005 are data-mutating DML with hardcoded paths

✅ Positive Findings

Legal model

  • IoC delegation model (RFC-003), hook mechanism (RFC-007), lex specialis override, and priority resolution are well-reasoned and align with Dutch legal structure
  • Untranslatables framework (RFC-012) correctly identifies encoding gaps
  • Cross-law cycle detection via ResolutionContext.visited is correct for stateless execution

Testing

  • Engine operations have strong unit coverage (~83 tests in operations.rs)
  • BDD suite covers 53 scenarios across 8 feature files including multi-law chains
  • Golden fixture tests provide regression protection for core execution paths
  • Pipeline integration tests properly use testcontainers for DB-backed testing

Database

  • Job queue uses FOR UPDATE SKIP LOCKED correctly for concurrent claiming
  • Advisory locks protect against concurrent harvests of the same law
  • Enrich jobs use proper partial unique index with ON CONFLICT DO NOTHING

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions