Skip to content

feat(pipeline): harvest related legislation discovered during enrichment#901

Merged
tdjager merged 5 commits into
mainfrom
feat/enrich-related-legislation
Jul 2, 2026
Merged

feat(pipeline): harvest related legislation discovered during enrichment#901
tdjager merged 5 commits into
mainfrom
feat/enrich-related-legislation

Conversation

@tdjager

@tdjager tdjager commented Jul 2, 2026

Copy link
Copy Markdown
Collaborator

Why

Regelingen reached only through the machine-readable model — a source.regulation binding, a legal_basis anchor, or an open delegation (open_terms/implements, e.g. regeling_standaardpremie filling wet_op_de_zorgtoeslag's standaardpremie) — are never auto-harvested. The recursive harvester follows only <extref> BWB hyperlinks in the source text, and those model links don't exist in the text — they are a product of enrichment. So the discovery must happen where the model is created: the enrichworker (which has corpus + DB). (A prior admin-side closure endpoint, #894, was closed — the deployed admin has no corpus.)

What

The enrichment agent now returns the related legislation it needs in a result envelope — a .enrichment-result.yaml sidecar the agent writes, kept deliberately out of the law YAML so the law artifact stays schema-conformant (extensible for future enrichment return-values). After a successful enrich, the worker:

  1. reads the envelope's related_legislation,
  2. resolves each entry to a BWB id — hybrid: agent bwb_idlaw_entries slug lookup → single-hit SRU title search (>1needs_confirmation, skip),
  3. enqueues a follow-up harvest per resolved id via the canonical dedup path.

Newly harvested laws auto-enrich (existing ENRICH_AUTO_ENQUEUE) and return their related legislation → the dependency graph fills itself by recursion.

Depth & priority

EnrichPayload.depth carries recursion depth harvest → enrich → harvest; each related harvest is depth+1 at priority = 40 − (depth+1), so deeper (more speculative) nesting yields to shallower/interactive harvests (Priority::new clamps 0..=100; queue is ORDER BY priority DESC).

Safety / guards

  • Opt-in: HARVEST_RELATED_LEGISLATION (off by default) + RELATED_HARVEST_MAX_DEPTH (default 2).
  • Reuses ENRICH_DAILY_LIMIT (spend) and create_harvest_job_if_not_exists (dedup); harvest_exhausted skipped.
  • Best-effort: the whole block runs after the enrichment is committed and swallows every error into logs — it can never fail the enrichment or the worker loop.

Also

  • search_bwb_by_name(client, q) extracted from the axum handler (reused by both); find_bwb_id_by_slug made pub; slug hits re-validated as BWB (CVDR skipped, so no malformed harvest).
  • No law-schema change. .claude/skills/law-generate/SKILL.md instructs writing the envelope.

Known limitations

  • Shared depth counter with the extref harvester → the loop fires for roots/shallow laws; a related_depth field is the clean follow-up.
  • Unambiguous-but-loose SRU match accepted as-is (opt-in, harvest-only, lands on a reviewable enrich/{provider} branch).

Testing

  • Unit (52 pipeline tests green): envelope (de)serialization (full/missing/absent/malformed → empty), is_valid_bwb_id/slugify, related_harvest_priority (clamps, −1/level), depth round-trips.
  • End-to-end (to run on the preview): set HARVEST_RELATED_LEGISLATION=true, (re-)enrich wet_op_de_zorgtoeslag → confirm it writes .enrichment-result.yaml, a follow-up harvest for the delegated regeling is enqueued (depth 1, priority 39), harvested, then auto-enriched. Verify ENRICH_DAILY_LIMIT still bounds spend.

tdjager added 3 commits July 2, 2026 18:58
Regelingen reached only through the machine-readable model (source.regulation,
legal_basis, and open_terms/implements delegations like regeling_standaardpremie)
are never auto-harvested: the recursive harvester follows only <extref> BWB
hyperlinks in the source text, and those model links are a product of enrichment.

The enrichment agent now returns the related legislation it needs in a result
envelope (.enrichment-result.yaml) — deliberately kept OUT of the law YAML so the
law artifact stays schema-conformant. After a successful enrich the worker reads
the envelope, resolves each entry to a BWB id (agent bwb_id -> law_entries slug ->
single-hit SRU title search), and enqueues a follow-up harvest for each. Newly
harvested laws auto-enrich and return their own related legislation, so the
dependency graph fills itself by recursion.

- EnrichPayload.depth carries recursion depth (harvest -> enrich -> harvest);
  each related harvest is depth+1 at priority 40-(depth+1), so deeper nesting
  yields to shallower/interactive harvests.
- Opt-in: HARVEST_RELATED_LEGISLATION (off by default) + RELATED_HARVEST_MAX_DEPTH
  (default 2); reuses ENRICH_DAILY_LIMIT for spend and create_harvest_job_if_not_exists
  for dedup. Best-effort: nothing here can fail the already-committed enrichment.
- search_bwb_by_name extracted from the axum handler for reuse; find_bwb_id_by_slug
  made pub; slug hits re-validated as BWB (CVDR skipped). RFC-025 documents the
  pattern and its known limitations.
Drop the HARVEST_RELATED_LEGISLATION opt-in flag. The follow-up harvest only
enqueues harvest jobs (no LLM cost); the expensive re-enrichment of those laws
stays gated by ENRICH_AUTO_ENQUEUE + ENRICH_DAILY_LIMIT, and the recursion is
bounded by RELATED_HARVEST_MAX_DEPTH. So there is nothing to protect behind a
separate flag.
The enrichment result envelope (.enrichment-result.yaml) and the existing
.enrichment.yaml metadata sidecar are written into a law directory but are not
law files. `find -name '*.yaml'` matches leading-dot names and the pre-commit
`files:` regex matches them too, so validating one fails (missing $id). Skip
dot-prefixed sidecars in both script/validate.sh and the validate-law-yaml hook.
@tdjager tdjager marked this pull request as ready for review July 2, 2026 20:01

@claude claude Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This PR adds a feedback loop that lets the enrichment agent declare the related legislation it just modeled, then enqueues follow-up harvests so the dependency graph fills itself in recursively. The core design (sidecar envelope, hybrid BWB resolution, depth/priority capping, best-effort error handling) is sound, and the unit tests cover the main serialization and resolution paths well.

Two issues found in the new resolution logic:


🟠 Significant — CVDR slug hit falls through to SRU name search

packages/pipeline/src/worker.rs, in resolve_related_bwb_id

When step (b) finds the slug in law_entries but the stored id is a CVDR id (not a BWB id), the code logs "skipping" but does not return — execution falls through to step (c), the SRU name search on wetten.overheid.nl.

Ok(Some(id)) if is_valid_bwb_id(&id) => return RelatedResolution::Resolved(id),
Ok(Some(id)) => {
    tracing::debug!(slug = %slug, resolved = %id, "slug resolved to a non-BWB id; skipping");
    // no return here — falls through to SRU search
}

law_entries does store CVDR ids (confirmed — the harvest worker calls upsert_law with the CVDR job.law_id). A law named e.g. "Verordening X" can have a slug hit mapping to a CVDR id, then a single SRU hit for a different national law with a similar name, producing Resolved(wrong_bwb_id) where Unresolved was intended. The comment itself says "skip non-BWB hits" — the code should do exactly that:

Ok(Some(id)) => {
    tracing::debug!(slug = %slug, resolved = %id, "slug resolved to a non-BWB id; skipping");
    return RelatedResolution::Unresolved;  // add this
}

🟡 Minor — resolved counter inflated by exhausted-skipped laws

packages/pipeline/src/worker.rs, in harvest_related_legislation

resolved += 1 fires before the HarvestExhausted guard. When a law is exhausted, the loop continues without incrementing enqueued, so the summary log silently folds both "exhausted-skipped" and "already-queued (dedup)" into the resolved − enqueued gap with no way to distinguish them.

resolved += 1;              // ← incremented here
if let Ok(law) = law_status::get_law(pool, &bwb_id).await {
    if law.status == LawStatusValue::HarvestExhausted {
        continue;           // enqueued not incremented — gap is ambiguous
    }
}

Moving resolved += 1 to after the exhausted check, or adding a dedicated exhausted counter to the summary log, would make the log actionable.


🟡 Minor — blocking I/O in async context

packages/pipeline/src/enrich.rs:438

read_enrichment_result_envelope uses std::fs::read_to_string (synchronous) and is called directly from the async execute_enrich_with_runner. The surrounding code already uses tokio::fs::write. In practice the file is tiny and arrives after a minutes-long LLM call, so the impact is negligible on normal disk — but under a slow filesystem (NFS, container overlay under memory pressure) this blocks the Tokio worker thread indefinitely. Wrapping in tokio::task::spawn_blocking or switching to tokio::fs::read_to_string would match the pattern used elsewhere in the file.

Address CI review findings:
- A CVDR slug hit no longer falls through to the SRU name search (the slug
  already identified the law; a title match could resolve a *different* national
  law). It now returns Unresolved.
- The harvest summary log separates already_queued and exhausted skips instead of
  conflating them in the resolved-but-not-enqueued gap.

Also drop RFC-025: the related-legislation harvest loop is an implementation
detail, not a cross-cutting design decision that warrants an RFC.
@github-actions github-actions Bot deleted a comment from claude Bot Jul 2, 2026
@github-actions github-actions Bot deleted a comment from claude Bot Jul 2, 2026
@tdjager

tdjager commented Jul 2, 2026

Copy link
Copy Markdown
Collaborator Author

Both review findings addressed in 4a24029: the CVDR slug hit now returns Unresolved instead of falling through to the SRU name search, and the harvest summary log separates already_queued/exhausted skips. Also dropped RFC-025 (this is an implementation detail, not an RFC).

@claude claude Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pipeline: harvest related legislation discovered during enrichment

Correctness

🟡 Minor — SRU result not validated as BWB ID (inconsistency with paths a and b)
packages/pipeline/src/worker.rs — the resolve_related_bwb_id function resolves via three paths:

  • (a) explicit bwb_id: guarded by is_valid_bwb_id
  • (b) slug lookup: guarded by is_valid_bwb_id
  • (c) SRU single hit: results[0].bwb_id.clone() returned directly, no validation

BwbSearchResult.bwb_id comes from parse_sru_response which parses wetten.overheid.nl SRU output — in practice that's always a BWBR ID — but the inconsistency means a malformed response would propagate an invalid ID into HarvestPayload.bwb_id. Since the harvest is best-effort this fails silently, but it's an easy one-liner fix:

Ok(results) if results.len() == 1 && is_valid_bwb_id(&results[0].bwb_id) => {
    RelatedResolution::Resolved(results[0].bwb_id.clone())
}

🟡 Minor — blocking I/O in async context
packages/pipeline/src/enrich.rs:438read_enrichment_result_envelope calls std::fs::read_to_string, blocking the Tokio runtime thread. Every other file operation in execute_enrich_with_runner uses tokio::fs. For a tiny YAML sidecar the impact is negligible, but it's inconsistent:

// current (sync, blocks runtime thread)
let content = match std::fs::read_to_string(&envelope_path) {}

// idiomatic in this async context
let content = match tokio::fs::read_to_string(&envelope_path).await {}

Requires making read_enrichment_result_envelope async and threading .await through the caller.


🟡 Minor — depth comment at harvest→enrich handoff is misleading
packages/pipeline/src/worker.rs — the comment at depth: payload.depth:

// Inherit the just-completed harvest's depth so a
// related-legislation harvest chain keeps counting.
depth: payload.depth,

payload here is a HarvestPayload whose depth is the extref recursion counter, not the related-legislation chain counter. The comment implies only related chains inherit this value. In reality, any harvest at extref depth ≥ related_harvest_max_depth (2 by default) will have its enrichment silently skip related-legislation follow-ups — the PR description acknowledges this as a known limitation but the comment in code doesn't signal it. Worth expanding:

// Inherit the harvest's depth (extref OR related-chain). This
// conflates both depth counters — a law reached via ≥2 extref hops
// won't get its related legislation harvested. Clean-up: a separate
//  field on HarvestPayload.
depth: payload.depth,

Rest of the diff

  • bwb_search.rs extraction of search_bwb_by_name: clean refactor, error mapping is correct.
  • config.rs / env parsing: parse().ok() silently falls back to 2 on any bad value — appropriate.
  • enrich.rs envelope structs: serde defaults are correct; graceful degradation (absent/malformed → empty) is correct.
  • .pre-commit-config.yaml / validate.sh: exclusion regexes are correct and consistent.
  • SKILL.md Phase 4.5: clear instructions; the CRITICAL warning not to put the envelope in the law YAML is exactly right.
  • Unit tests cover serialization, absent/malformed envelopes, clamping, and slug normalization. Good coverage.

No critical or significant issues.

@claude claude Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pipeline: harvest related legislation discovered during enrichment

Correctness

🟡 Minor — SRU result not validated as BWB ID (inconsistency with paths a and b)
packages/pipeline/src/worker.rsresolve_related_bwb_id guards path (a) and (b) with is_valid_bwb_id, but path (c) returns results[0].bwb_id.clone() directly:

Ok(results) if results.len() == 1 => RelatedResolution::Resolved(results[0].bwb_id.clone()),

BwbSearchResult.bwb_id comes from parse_sru_response parsing wetten.overheid.nl SRU output — in practice always a BWBR ID — but the inconsistency means a malformed API response propagates an invalid ID into HarvestPayload.bwb_id. Since the harvest is best-effort this fails silently; easy fix:

Ok(results) if results.len() == 1 && is_valid_bwb_id(&results[0].bwb_id) => {
    RelatedResolution::Resolved(results[0].bwb_id.clone())
}

🟡 Minor — blocking I/O in async context
packages/pipeline/src/enrich.rs:438read_enrichment_result_envelope calls std::fs::read_to_string, blocking the Tokio runtime thread. Every other file operation in execute_enrich_with_runner uses tokio::fs. For a tiny sidecar the impact is negligible, but it's inconsistent with the surrounding code.


🟡 Minor — depth comment at harvest→enrich handoff understates the conflation
packages/pipeline/src/worker.rs:

// Inherit the just-completed harvest's depth so a
// related-legislation harvest chain keeps counting.
depth: payload.depth,

payload here is a HarvestPayload whose depth is the extref recursion counter. The comment implies only related chains inherit this value. In reality any law reached via ≥ RELATED_HARVEST_MAX_DEPTH extref hops will silently have its related-legislation follow-ups skipped. The PR description documents this as a known limitation; the in-code comment should say so too.


Rest of the diff

  • bwb_search.rs extraction of search_bwb_by_name: clean refactor, error mapping is correct.
  • config.rs env parsing: parse().ok() silently falls back to 2 on bad input — appropriate for a tuning knob.
  • enrich.rs envelope structs: serde defaults are correct; graceful degradation (absent/malformed → empty) is correct.
  • .pre-commit-config.yaml / validate.sh: exclusion patterns are correct and consistent with each other.
  • SKILL.md Phase 4.5: clear instructions; the CRITICAL warning not to put the envelope inside the law YAML is exactly right.
  • Unit tests cover serialization, absent/malformed envelopes, priority clamping, and slug normalization — good coverage.

No critical or significant issues.

- Validate the single-hit SRU result as a BWB id before resolving, so a
  malformed SRU id can't slip into a harvest payload (paths a/b already validate).
- Read the .enrichment-result.yaml sidecar via tokio::fs (was blocking std::fs)
  for consistency with the rest of execute_enrich_with_runner.
- Clarify the depth-inherit comment: the field is the shared extref-recursion
  counter, so deep-via-extref laws skip related discovery (roots/shallow laws,
  the intended case, are unaffected).
@github-actions github-actions Bot deleted a comment from claude Bot Jul 2, 2026
@github-actions github-actions Bot deleted a comment from claude Bot Jul 2, 2026
@github-actions github-actions Bot deleted a comment from claude Bot Jul 2, 2026
@tdjager tdjager merged commit d2992e9 into main Jul 2, 2026
30 checks passed
@tdjager tdjager deleted the feat/enrich-related-legislation branch July 2, 2026 20:46
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant