Skip to content
Merged
Show file tree
Hide file tree
Changes from 3 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
38 changes: 38 additions & 0 deletions .claude/skills/law-generate/SKILL.md
Original file line number Diff line number Diff line change
Expand Up @@ -763,6 +763,44 @@ The JSON payload format (written to the temp file):
laws (>20 articles), this limit applies per batch — each batch of ~15 articles
gets its own 3-iteration budget

## Phase 4.5: Write the related-legislation result envelope

After the `machine_readable` sections are final, write a **sibling result
envelope** so the pipeline can auto-harvest the legislation this law depends on
(delegated regelingen and cross-law references the extref-only harvester misses).

Write it next to the law YAML as `.enrichment-result.yaml` (same directory,
e.g. `corpus/regulation/nl/wet/wet_op_de_zorgtoeslag/.enrichment-result.yaml`).
Use the `Write` tool — no new agent tools are needed.

```yaml
# .enrichment-result.yaml — result envelope, NOT part of the law schema
law_id: wet_op_de_zorgtoeslag
related_legislation:
- name: Regeling vaststelling standaardpremie en bestuursrechtelijke premie
relation: delegated_regeling # source_regulation | legal_basis | delegated_regeling
bwb_id: BWBR0037841 # optional, best-effort
slug: regeling_standaardpremie # optional, best-effort
open_term: standaardpremie # optional, only for delegations
- name: Algemene wet inkomensafhankelijke regelingen
relation: source_regulation
```

Coverage: add one entry for **every** `source.regulation` you bound, every
`legal_basis` you anchored on, and every `open_term` delegation you declared.
Fields:

- `name` — **required**; the human-readable law/regeling title (used for search
fallback when no id/slug is given).
- `relation` — one of `source_regulation`, `legal_basis`, `delegated_regeling`.
- `bwb_id`, `slug`, `open_term` — **optional**, best-effort. Supply what you know
(a known `bwb_id` resolves fastest); leave the rest out.

**CRITICAL — this MUST NOT go in the law YAML.** The law file stays strictly
schema-conformant (`just validate` must still pass). The related-legislation list
lives only in the `.enrichment-result.yaml` sidecar, which the pipeline reads
separately. Do not add a `related_legislation:` key anywhere inside the law YAML.

## Phase 5: Report

Report to the user:
Expand Down
3 changes: 3 additions & 0 deletions .pre-commit-config.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -59,6 +59,9 @@ repos:
language: system
pass_filenames: true
files: ^corpus/regulation/.*\.yaml$
# Skip dot-prefixed sidecars (.enrichment.yaml, .enrichment-result.yaml):
# enrichment metadata/result envelopes, not law files.
exclude: (^|/)\.[^/]*\.yaml$
types: [yaml]

- id: skills-no-casus
Expand Down
130 changes: 130 additions & 0 deletions docs/src/content/rfcs/rfc-025.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,130 @@
---
title: "RFC-025: Related-Legislation Discovery via an Enrichment Result Envelope"
status: Proposed
implementation: Partially implemented
date: '2026-07-02'
authors:
- Tim de Jager
depends_on:
- RFC-007 (Cross-Law Execution Model)
- RFC-010 (Federated Corpus)
short_title: Related-Legislation Discovery
---

## Summary

The recursive harvester follows only the **explicit BWB cross-links present in
the source text** (extrefs). That misses the legislation a machine-readable
model actually depends on but the text does not hyperlink: delegated regelingen
(e.g. `regeling_standaardpremie`, filled in via the `open_term`/`implements`
IoC pattern), cross-law `source.regulation` bindings whose target is named in
prose, and `legal_basis` anchors. The dependency graph therefore never fills
itself in for exactly the laws that matter.

This RFC introduces a small feedback loop: **after enrichment, the agent
declares the related legislation it just modeled, and the worker enqueues
follow-up harvests for it.** Each harvested law is in turn enriched, which
declares its own dependencies, and so on — the graph completes itself via
recursion.

## The result envelope (why it lives outside the law schema)

The enrichment agent returns related legislation in a **result envelope
sidecar**, written next to the enriched law YAML as `.enrichment-result.yaml`:

```yaml
law_id: wet_op_de_zorgtoeslag
related_legislation:
- name: Regeling vaststelling standaardpremie en bestuursrechtelijke premie
relation: delegated_regeling # source_regulation | legal_basis | delegated_regeling
bwb_id: BWBR0037841 # optional, best-effort
slug: regeling_standaardpremie # optional, best-effort
open_term: standaardpremie # optional, delegations only
- name: Algemene wet inkomensafhankelijke regelingen
relation: source_regulation
```

This is deliberately **not** a law-schema change. The law YAML must stay
strictly schema-conformant (`just validate` is the contract), and routing
metadata like "which other laws should the pipeline go fetch" is an artefact of
the enrichment *process*, not of the law itself. Putting it in a sidecar keeps
the two concerns separate: the law model is validated against the schema, and
the envelope is read by the pipeline separately and never validated as a law.
The envelope is staged alongside the law as provenance (it records what the
agent believed the dependencies were at enrichment time), but a malformed or
absent envelope can never fail an otherwise-successful enrichment — it degrades
to "no related legislation".

## Worker hook, depth, and priority

When a completed enrichment is committed, the worker resolves each declared entry
to a BWB id and enqueues a follow-up harvest.

**Depth** propagates so the recursion is bounded and observable:

- A harvest job carries `depth` (already used by the existing extref follow-up
loop). When that harvest auto-enqueues its enrichment, the enrichment
**inherits** the harvest's depth.
- The follow-up harvests an enrichment spawns get `depth + 1`.
- Admin-requested enrichments are roots (`depth = None`, i.e. 0).

**Priority** decreases one point per nesting level
(`RELATED_HARVEST_BASE - (depth + 1)`, base 40, clamped to `0..=100`), so a
deep, speculative related-harvest always yields to shallower and to
editor-/root-requested harvests.

## Hybrid resolution

A declared entry is resolved to a BWB id in order, stopping at the first hit:

1. an explicit `bwb_id` matching `^BWBR\d{7}$`;
2. a slug lookup against `law_entries` (`slug`, else a slugified `name`);
3. an SRU title search by `name` — accepted **only** when it returns exactly
one law. More than one hit is logged as `needs_confirmation` and skipped (a
human decides); zero hits or an error is skipped.

The worker emits a single summary log per enrichment with the
total/resolved/enqueued/needs_confirmation/unresolved counts.

## Guards

The loop is always on but designed to be safe to leave on:

- **Cheap by itself**: it only enqueues *harvest* jobs (no LLM). The LLM-costly
re-enrichment of a newly harvested law is separately gated by
`ENRICH_AUTO_ENQUEUE` and bounded by `ENRICH_DAILY_LIMIT`, so the expensive part
of the recursion never runs unless enrichment auto-enqueue is explicitly on.
- **Depth cap**: `RELATED_HARVEST_MAX_DEPTH` (default 2) stops the recursion.
- **Daily enrich cap**: `ENRICH_DAILY_LIMIT` continues to bound how many
enrichments (and therefore how much LLM spend) run per day, independent of how
many harvests the loop enqueues.
- **Dedup**: `create_harvest_job_if_not_exists` prevents duplicate queued jobs;
`harvest_exhausted` laws are skipped; version resolution/shadowing (RFC-010)
keeps re-harvests from multiplying the corpus.

## Known limitations

- **Shared depth counter.** The related-harvest recursion reuses
`HarvestPayload.depth`, the same field the extref recursive harvester
increments. A law reached via ≥ `RELATED_HARVEST_MAX_DEPTH` extref hops
therefore arrives at enrichment already at/above the cap, so related-legislation
discovery is skipped for it — even if that is the *first* such opportunity for
that law. In practice the feature fires for roots and shallow laws (the intended
case); deep-in-a-chain laws usually already have their references harvested via
extref anyway. A dedicated `related_depth` field would lift this restriction and
is the clean follow-up.
- **Unambiguous-but-loose SRU match.** A single SRU title hit is accepted as-is;
there is no similarity threshold, so a loosely-matching unique result could
enqueue the wrong law. Impact is low — the follow-up is harvest-only (no LLM),
and its output lands on a reviewable `enrich/{provider}` branch — but it is why `>1`
hits deliberately degrade to `needs_confirmation`/skip rather than guessing.

## Implementation status

Implemented: the envelope types and sidecar read, `EnrichPayload.depth` with
propagation across all three enqueue sites, the SRU search extracted into a
client-taking function shared with the API handler, the always-on worker hook with
hybrid resolution and summary logging, the `RELATED_HARVEST_MAX_DEPTH` cap, and the
enrichment-skill step that writes the envelope. Not yet built: a UI/admin surface for the
`needs_confirmation` cases (currently only logged), and a build-time slug/name
index to make resolution independent of harvest order.
2 changes: 2 additions & 0 deletions packages/admin/src/handlers.rs
Original file line number Diff line number Diff line change
Expand Up @@ -644,6 +644,8 @@ pub async fn create_enrich_jobs(
law_id: law_id.clone(),
yaml_path: yaml_path.clone(),
provider: Some((*provider_name).to_string()),
// Admin-requested enrichments are roots of the related-harvest chain.
depth: None,
};

let payload_json = serde_json::to_value(&enrich_payload).map_err(|e| {
Expand Down
58 changes: 29 additions & 29 deletions packages/pipeline/src/api/bwb_search.rs
Original file line number Diff line number Diff line change
Expand Up @@ -29,9 +29,24 @@ pub async fn search_bwb(
State(state): State<ApiState>,
Query(params): Query<SearchParams>,
) -> Result<Json<Vec<BwbSearchResult>>, (StatusCode, String)> {
let q = params.q.trim();
if q.is_empty() || q.len() < 3 {
return Ok(Json(vec![]));
match search_bwb_by_name(&state.http_client, params.q.trim()).await {
Ok(results) => Ok(Json(results)),
Err(e) => Err((StatusCode::BAD_GATEWAY, e)),
}
}

/// Search wetten.overheid.nl via the SRU API for laws matching `q`.
///
/// The client-taking core shared by the axum handler and the enrich worker's
/// related-legislation resolution. Queries shorter than 3 characters (after the
/// same sanitize as the handler) return an empty list rather than an error.
pub async fn search_bwb_by_name(
client: &reqwest::Client,
q: &str,
) -> Result<Vec<BwbSearchResult>, String> {
let q = q.trim();
if q.len() < 3 {
return Ok(vec![]);
}

let sanitized: String = q
Expand All @@ -50,35 +65,20 @@ pub async fn search_bwb(
("maximumRecords", &MAX_RESULTS.to_string()),
],
)
.map_err(|e| {
(
StatusCode::INTERNAL_SERVER_ERROR,
format!("URL build error: {e}"),
)
})?;

let response = state
.http_client
.map_err(|e| format!("URL build error: {e}"))?;

let response = client
.get(url)
.send()
.await
.map_err(|e| (StatusCode::BAD_GATEWAY, format!("BWB search failed: {e}")))?;

let xml_text = response.text().await.map_err(|e| {
(
StatusCode::BAD_GATEWAY,
format!("BWB response read failed: {e}"),
)
})?;

let results = parse_sru_response(&xml_text).map_err(|e| {
(
StatusCode::INTERNAL_SERVER_ERROR,
format!("XML parse error: {e}"),
)
})?;

Ok(Json(results))
.map_err(|e| format!("BWB search failed: {e}"))?;

let xml_text = response
.text()
.await
.map_err(|e| format!("BWB response read failed: {e}"))?;

parse_sru_response(&xml_text).map_err(|e| format!("XML parse error: {e}"))
}

/// Parse SRU XML response and extract unique laws (deduplicated by BWBR ID).
Expand Down
2 changes: 1 addition & 1 deletion packages/pipeline/src/api/harvest.rs
Original file line number Diff line number Diff line change
Expand Up @@ -155,7 +155,7 @@ async fn resolve_identifiers(
}

/// Find a law's BWB ID by its slug in the law_entries table.
async fn find_bwb_id_by_slug(
pub async fn find_bwb_id_by_slug(
pool: &sqlx::PgPool,
slug: &str,
) -> Result<Option<String>, sqlx::Error> {
Expand Down
12 changes: 12 additions & 0 deletions packages/pipeline/src/config.rs
Original file line number Diff line number Diff line change
Expand Up @@ -97,6 +97,11 @@ pub struct WorkerConfig {
/// Off by default; enrichment is otherwise requested explicitly via the admin
/// API. Configurable via `ENRICH_AUTO_ENQUEUE`.
pub auto_enrich_enqueue: bool,
/// Maximum recursion depth for related-legislation follow-up harvests.
/// A depth-0 enrichment may enqueue harvests at depth 1, whose enrichments
/// may enqueue at depth 2, etc., up to this cap. Default: 2. Configurable
/// via `RELATED_HARVEST_MAX_DEPTH`.
pub related_harvest_max_depth: u32,
}

impl std::fmt::Debug for WorkerConfig {
Expand All @@ -118,6 +123,7 @@ impl std::fmt::Debug for WorkerConfig {
)
.field("enrich_daily_limit", &self.enrich_daily_limit)
.field("auto_enrich_enqueue", &self.auto_enrich_enqueue)
.field("related_harvest_max_depth", &self.related_harvest_max_depth)
.finish()
}
}
Expand Down Expand Up @@ -193,6 +199,11 @@ impl WorkerConfig {
})
.unwrap_or(false);

let related_harvest_max_depth: u32 = std::env::var("RELATED_HARVEST_MAX_DEPTH")
.ok()
.and_then(|v| v.parse().ok())
.unwrap_or(2);

Ok(Self {
database_url,
max_connections,
Expand All @@ -207,6 +218,7 @@ impl WorkerConfig {
max_consecutive_resource_failures,
enrich_daily_limit,
auto_enrich_enqueue,
related_harvest_max_depth,
})
}

Expand Down
Loading
Loading