Skip to content

Commit d2992e9

Browse files
authored
feat(pipeline): harvest related legislation discovered during enrichment (#901)
* feat(pipeline): harvest related legislation discovered during enrichment Regelingen reached only through the machine-readable model (source.regulation, legal_basis, and open_terms/implements delegations like regeling_standaardpremie) are never auto-harvested: the recursive harvester follows only <extref> BWB hyperlinks in the source text, and those model links are a product of enrichment. The enrichment agent now returns the related legislation it needs in a result envelope (.enrichment-result.yaml) — deliberately kept OUT of the law YAML so the law artifact stays schema-conformant. After a successful enrich the worker reads the envelope, resolves each entry to a BWB id (agent bwb_id -> law_entries slug -> single-hit SRU title search), and enqueues a follow-up harvest for each. Newly harvested laws auto-enrich and return their own related legislation, so the dependency graph fills itself by recursion. - EnrichPayload.depth carries recursion depth (harvest -> enrich -> harvest); each related harvest is depth+1 at priority 40-(depth+1), so deeper nesting yields to shallower/interactive harvests. - Opt-in: HARVEST_RELATED_LEGISLATION (off by default) + RELATED_HARVEST_MAX_DEPTH (default 2); reuses ENRICH_DAILY_LIMIT for spend and create_harvest_job_if_not_exists for dedup. Best-effort: nothing here can fail the already-committed enrichment. - search_bwb_by_name extracted from the axum handler for reuse; find_bwb_id_by_slug made pub; slug hits re-validated as BWB (CVDR skipped). RFC-025 documents the pattern and its known limitations. * refactor(pipeline): make related-legislation harvest always-on Drop the HARVEST_RELATED_LEGISLATION opt-in flag. The follow-up harvest only enqueues harvest jobs (no LLM cost); the expensive re-enrichment of those laws stays gated by ENRICH_AUTO_ENQUEUE + ENRICH_DAILY_LIMIT, and the recursion is bounded by RELATED_HARVEST_MAX_DEPTH. So there is nothing to protect behind a separate flag. * fix(dev): exclude enrichment sidecars from law YAML validation The enrichment result envelope (.enrichment-result.yaml) and the existing .enrichment.yaml metadata sidecar are written into a law directory but are not law files. `find -name '*.yaml'` matches leading-dot names and the pre-commit `files:` regex matches them too, so validating one fails (missing $id). Skip dot-prefixed sidecars in both script/validate.sh and the validate-law-yaml hook. * fix(pipeline): tighten related-legislation resolution; drop RFC-025 Address CI review findings: - A CVDR slug hit no longer falls through to the SRU name search (the slug already identified the law; a title match could resolve a *different* national law). It now returns Unresolved. - The harvest summary log separates already_queued and exhausted skips instead of conflating them in the resolved-but-not-enqueued gap. Also drop RFC-025: the related-legislation harvest loop is an implementation detail, not a cross-cutting design decision that warrants an RFC. * fix(pipeline): address review nits on related-legislation resolution - Validate the single-hit SRU result as a BWB id before resolving, so a malformed SRU id can't slip into a harvest payload (paths a/b already validate). - Read the .enrichment-result.yaml sidecar via tokio::fs (was blocking std::fs) for consistency with the rest of execute_enrich_with_runner. - Clarify the depth-inherit comment: the field is the shared extref-recursion counter, so deep-via-extref laws skip related discovery (roots/shallow laws, the intended case, are unaffected).
1 parent 454b25c commit d2992e9

10 files changed

Lines changed: 577 additions & 35 deletions

File tree

.claude/skills/law-generate/SKILL.md

Lines changed: 38 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -763,6 +763,44 @@ The JSON payload format (written to the temp file):
763763
laws (>20 articles), this limit applies per batch — each batch of ~15 articles
764764
gets its own 3-iteration budget
765765

766+
## Phase 4.5: Write the related-legislation result envelope
767+
768+
After the `machine_readable` sections are final, write a **sibling result
769+
envelope** so the pipeline can auto-harvest the legislation this law depends on
770+
(delegated regelingen and cross-law references the extref-only harvester misses).
771+
772+
Write it next to the law YAML as `.enrichment-result.yaml` (same directory,
773+
e.g. `corpus/regulation/nl/wet/wet_op_de_zorgtoeslag/.enrichment-result.yaml`).
774+
Use the `Write` tool — no new agent tools are needed.
775+
776+
```yaml
777+
# .enrichment-result.yaml — result envelope, NOT part of the law schema
778+
law_id: wet_op_de_zorgtoeslag
779+
related_legislation:
780+
- name: Regeling vaststelling standaardpremie en bestuursrechtelijke premie
781+
relation: delegated_regeling # source_regulation | legal_basis | delegated_regeling
782+
bwb_id: BWBR0037841 # optional, best-effort
783+
slug: regeling_standaardpremie # optional, best-effort
784+
open_term: standaardpremie # optional, only for delegations
785+
- name: Algemene wet inkomensafhankelijke regelingen
786+
relation: source_regulation
787+
```
788+
789+
Coverage: add one entry for **every** `source.regulation` you bound, every
790+
`legal_basis` you anchored on, and every `open_term` delegation you declared.
791+
Fields:
792+
793+
- `name` — **required**; the human-readable law/regeling title (used for search
794+
fallback when no id/slug is given).
795+
- `relation` — one of `source_regulation`, `legal_basis`, `delegated_regeling`.
796+
- `bwb_id`, `slug`, `open_term` — **optional**, best-effort. Supply what you know
797+
(a known `bwb_id` resolves fastest); leave the rest out.
798+
799+
**CRITICAL — this MUST NOT go in the law YAML.** The law file stays strictly
800+
schema-conformant (`just validate` must still pass). The related-legislation list
801+
lives only in the `.enrichment-result.yaml` sidecar, which the pipeline reads
802+
separately. Do not add a `related_legislation:` key anywhere inside the law YAML.
803+
766804
## Phase 5: Report
767805

768806
Report to the user:

.pre-commit-config.yaml

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -59,6 +59,9 @@ repos:
5959
language: system
6060
pass_filenames: true
6161
files: ^corpus/regulation/.*\.yaml$
62+
# Skip dot-prefixed sidecars (.enrichment.yaml, .enrichment-result.yaml):
63+
# enrichment metadata/result envelopes, not law files.
64+
exclude: (^|/)\.[^/]*\.yaml$
6265
types: [yaml]
6366

6467
- id: skills-no-casus

packages/admin/src/handlers.rs

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -644,6 +644,8 @@ pub async fn create_enrich_jobs(
644644
law_id: law_id.clone(),
645645
yaml_path: yaml_path.clone(),
646646
provider: Some((*provider_name).to_string()),
647+
// Admin-requested enrichments are roots of the related-harvest chain.
648+
depth: None,
647649
};
648650

649651
let payload_json = serde_json::to_value(&enrich_payload).map_err(|e| {

packages/pipeline/src/api/bwb_search.rs

Lines changed: 29 additions & 29 deletions
Original file line numberDiff line numberDiff line change
@@ -29,9 +29,24 @@ pub async fn search_bwb(
2929
State(state): State<ApiState>,
3030
Query(params): Query<SearchParams>,
3131
) -> Result<Json<Vec<BwbSearchResult>>, (StatusCode, String)> {
32-
let q = params.q.trim();
33-
if q.is_empty() || q.len() < 3 {
34-
return Ok(Json(vec![]));
32+
match search_bwb_by_name(&state.http_client, params.q.trim()).await {
33+
Ok(results) => Ok(Json(results)),
34+
Err(e) => Err((StatusCode::BAD_GATEWAY, e)),
35+
}
36+
}
37+
38+
/// Search wetten.overheid.nl via the SRU API for laws matching `q`.
39+
///
40+
/// The client-taking core shared by the axum handler and the enrich worker's
41+
/// related-legislation resolution. Queries shorter than 3 characters (after the
42+
/// same sanitize as the handler) return an empty list rather than an error.
43+
pub async fn search_bwb_by_name(
44+
client: &reqwest::Client,
45+
q: &str,
46+
) -> Result<Vec<BwbSearchResult>, String> {
47+
let q = q.trim();
48+
if q.len() < 3 {
49+
return Ok(vec![]);
3550
}
3651

3752
let sanitized: String = q
@@ -50,35 +65,20 @@ pub async fn search_bwb(
5065
("maximumRecords", &MAX_RESULTS.to_string()),
5166
],
5267
)
53-
.map_err(|e| {
54-
(
55-
StatusCode::INTERNAL_SERVER_ERROR,
56-
format!("URL build error: {e}"),
57-
)
58-
})?;
59-
60-
let response = state
61-
.http_client
68+
.map_err(|e| format!("URL build error: {e}"))?;
69+
70+
let response = client
6271
.get(url)
6372
.send()
6473
.await
65-
.map_err(|e| (StatusCode::BAD_GATEWAY, format!("BWB search failed: {e}")))?;
66-
67-
let xml_text = response.text().await.map_err(|e| {
68-
(
69-
StatusCode::BAD_GATEWAY,
70-
format!("BWB response read failed: {e}"),
71-
)
72-
})?;
73-
74-
let results = parse_sru_response(&xml_text).map_err(|e| {
75-
(
76-
StatusCode::INTERNAL_SERVER_ERROR,
77-
format!("XML parse error: {e}"),
78-
)
79-
})?;
80-
81-
Ok(Json(results))
74+
.map_err(|e| format!("BWB search failed: {e}"))?;
75+
76+
let xml_text = response
77+
.text()
78+
.await
79+
.map_err(|e| format!("BWB response read failed: {e}"))?;
80+
81+
parse_sru_response(&xml_text).map_err(|e| format!("XML parse error: {e}"))
8282
}
8383

8484
/// Parse SRU XML response and extract unique laws (deduplicated by BWBR ID).

packages/pipeline/src/api/harvest.rs

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -155,7 +155,7 @@ async fn resolve_identifiers(
155155
}
156156

157157
/// Find a law's BWB ID by its slug in the law_entries table.
158-
async fn find_bwb_id_by_slug(
158+
pub async fn find_bwb_id_by_slug(
159159
pool: &sqlx::PgPool,
160160
slug: &str,
161161
) -> Result<Option<String>, sqlx::Error> {

packages/pipeline/src/config.rs

Lines changed: 12 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -97,6 +97,11 @@ pub struct WorkerConfig {
9797
/// Off by default; enrichment is otherwise requested explicitly via the admin
9898
/// API. Configurable via `ENRICH_AUTO_ENQUEUE`.
9999
pub auto_enrich_enqueue: bool,
100+
/// Maximum recursion depth for related-legislation follow-up harvests.
101+
/// A depth-0 enrichment may enqueue harvests at depth 1, whose enrichments
102+
/// may enqueue at depth 2, etc., up to this cap. Default: 2. Configurable
103+
/// via `RELATED_HARVEST_MAX_DEPTH`.
104+
pub related_harvest_max_depth: u32,
100105
}
101106

102107
impl std::fmt::Debug for WorkerConfig {
@@ -118,6 +123,7 @@ impl std::fmt::Debug for WorkerConfig {
118123
)
119124
.field("enrich_daily_limit", &self.enrich_daily_limit)
120125
.field("auto_enrich_enqueue", &self.auto_enrich_enqueue)
126+
.field("related_harvest_max_depth", &self.related_harvest_max_depth)
121127
.finish()
122128
}
123129
}
@@ -193,6 +199,11 @@ impl WorkerConfig {
193199
})
194200
.unwrap_or(false);
195201

202+
let related_harvest_max_depth: u32 = std::env::var("RELATED_HARVEST_MAX_DEPTH")
203+
.ok()
204+
.and_then(|v| v.parse().ok())
205+
.unwrap_or(2);
206+
196207
Ok(Self {
197208
database_url,
198209
max_connections,
@@ -207,6 +218,7 @@ impl WorkerConfig {
207218
max_consecutive_resource_failures,
208219
enrich_daily_limit,
209220
auto_enrich_enqueue,
221+
related_harvest_max_depth,
210222
})
211223
}
212224

0 commit comments

Comments
 (0)