refactor: consume regelrecht-law-model in pipeline and harvester by tdjager · Pull Request #858 · MinBZK/regelrecht

tdjager · 2026-06-25T08:45:52Z

What & why

Advies 11 from the codebase audit: the canonical law-YAML model lives in the leaf crate regelrecht-law-model (already consumed by engine, corpus, tui). The two remaining consumers kept a partial mirror, so a schema change had to be hand-synced in several places. This routes the executable model through the single source while respecting the formatting/projection code that a naive parse→serialize roundtrip would break.

pipeline — type the read/count side

Before: count_article_stats walked the law as untyped serde_yaml_ng::Value — map.get("articles"), article_map.contains_key("machine_readable") (stringly-typed field access).

After: it parses into regelrecht_law_model::ArticleBasedLaw and counts articles.len() / articles.iter().filter(|a| a.machine_readable.is_some()).

The surgical write path is intentionally left untouched: production enrichment is done by an external LLM subprocess that edits the file directly, so there is no Rust-side model write to unify.
Behavior change: a structurally-invalid law now surfaces as a parse error in the count instead of being silently undercounted — desirable, as this only runs on real harvested/enriched corpus files. Test fixtures updated to realistic minimal laws + an empty-machine_readable edge case.

harvester — drop a duplicate, keep the write-projection

The writer is a deliberate write-projection: it emits preamble, organisation, cvdr_id and per-article references (which the executable ArticleBasedLaw does not carry) with bespoke quoting/indent post-processing, so it is not routed through ArticleBasedLaw (that would change byte output and lose data).

Before: an internal YamlReference struct duplicated types::Reference field-for-field. After: it is removed and types::Reference is serialized directly (byte-identical — same fields, order, skip_serializing_if).
Added regelrecht-law-model as a dev-dependency and two regression tests the repo lacked: golden byte-identity for a BWB and a CVDR law (previously only substrings were asserted), and a conformance test that the harvested output deserializes cleanly into ArticleBasedLaw.

Out of scope

No new fields / reordering / skip_serializing_if churn on ArticleBasedLaw (it is deserialize-only today and harvested output must stay byte-identical). harvester/src/types.rs Law/Article/Preamble/Reference stay — they model the scraped source document, not the YAML document.

Verification

cargo fmt --check, clippy --all-features -Dwarnings, cargo check, and the harvester + pipeline test suites are green. The new harvester golden test is the byte-identity regression proof.

claude

Harvester en pipeline

Correctheid

De kernwijziging in pipeline/src/enrich.rs is correct. ArticleBasedLaw heeft machine_readable: Option<MachineReadable> en valid_from: Option<String>, zodat de typed parse correct werkt voor alle bestaande fixtures.

De byte-identiteitsbelofte van YamlReference → Reference is geverifieerd: beide structs hebben exact dezelfde velden, volgorde en skip_serializing_if-annotaties (packages/harvester/src/types.rs:197). De gouden tests leggen dit nu formeel vast.

Gedragswijziging `machine_readable: null`

🟡 Minor — De overstap van contains_key("machine_readable") naar article.machine_readable.is_some() verandert de semantiek voor machine_readable: null:

Oud: null heeft een key → telt als verrijkt
Nieuw: null deserialiseert als None → telt als onverrijkt

De PR documenteert dit expliciet en stelt dat geen enkel corpusbestand de null-vorm gebruikt. Dat is aannemelijk, maar als een LLM ooit machine_readable: null emitteert in plaats van de key weg te laten, daalt de verrijkingsteller stil. De bestaande test dekt de nieuwe semantiek, maar een expliciete bewaking op corpusniveau ontbreekt. Geen blocker, maar het is het waard om in het oog te houden bij toekomstige LLM-uitvoerchecks.

Overige observaties

De conformancetest (test_generated_yaml_conforms_to_law_model) is een mooie toevoeging: hij bewijst dat de harvester-schrijver trouw blijft aan het canonieke model zonder dat ze structureel samengevoegd hoeven te worden.

Geen kritische of significante bevindingen. De refactor verwijdert terecht stringly-typed YAML-traversal en elimineert een duplicaatstructuur die handmatig gesynchroniseerd moest worden.

claude

Harvester en pipeline

Correctheid

De kernwijziging in pipeline/src/enrich.rs is correct. ArticleBasedLaw heeft machine_readable: Option<MachineReadable> en valid_from: Option<String>, zodat de typed parse correct werkt voor alle bestaande fixtures.

De byte-identiteitsbelofte van YamlReference → Reference is geverifieerd: beide structs hebben exact dezelfde velden, volgorde en skip_serializing_if-annotaties (packages/harvester/src/types.rs:197). De gouden tests leggen dit nu formeel vast.

Gedragswijziging machine_readable: null

🟡 Minor — De overstap van contains_key("machine_readable") naar article.machine_readable.is_some() verandert de semantiek voor machine_readable: null:

Oud: null heeft een key, telt als verrijkt
Nieuw: null deserialiseert als None, telt als onverrijkt

De PR documenteert dit expliciet en stelt dat geen enkel corpusbestand de null-vorm gebruikt. Dat is aannemelijk, maar als een LLM ooit machine_readable: null emitteert in plaats van de key weg te laten, daalt de verrijkingsteller stil. De bestaande test dekt de nieuwe semantiek, maar een expliciete bewaking op corpusniveau ontbreekt. Geen blocker, maar het is het waard om in het oog te houden bij toekomstige LLM-uitvoerchecks.

Overige observaties

De conformancetest (test_generated_yaml_conforms_to_law_model) is een mooie toevoeging: hij bewijst dat de harvester-schrijver trouw blijft aan het canonieke model zonder dat ze structureel samengevoegd hoeven te worden.

Geen kritische of significante bevindingen. De refactor verwijdert terecht stringly-typed YAML-traversal en elimineert een duplicaatstructuur die handmatig gesynchroniseerd moest worden.

claude

Harvester en pipeline

Correctheid

De kernwijziging in pipeline/src/enrich.rs is correct. ArticleBasedLaw heeft machine_readable: Option<MachineReadable> en valid_from: Option<String>, zodat de typed parse correct werkt voor alle bestaande fixtures.

De byte-identiteitsbelofte van YamlReference → Reference is geverifieerd: beide structs hebben exact dezelfde velden, volgorde en skip_serializing_if-annotaties (packages/harvester/src/types.rs:197). De gouden tests leggen dit nu formeel vast.

Gedragswijziging `machine_readable: null`

🟡 Minor — De overstap van contains_key("machine_readable") naar article.machine_readable.is_some() verandert de semantiek voor machine_readable: null:

Oud: null heeft een key → telt als verrijkt
Nieuw: null deserialiseert als None → telt als onverrijkt

De PR documenteert dit expliciet en stelt dat geen enkel corpusbestand de null-vorm gebruikt. Dat is aannemelijk, maar als een LLM ooit machine_readable: null emitteert in plaats van de key weg te laten, daalt de verrijkingsteller stil. De bestaande test dekt de nieuwe semantiek, maar een expliciete bewaking op corpusniveau ontbreekt. Geen blocker, maar het is het waard om in het oog te houden bij toekomstige LLM-uitvoerchecks.

Overige observaties

De conformancetest (test_generated_yaml_conforms_to_law_model) is een mooie toevoeging: hij bewijst dat de harvester-schrijver trouw blijft aan het canonieke model zonder dat ze structureel samengevoegd hoeven te worden.

Geen kritische of significante bevindingen. De refactor verwijdert terecht stringly-typed YAML-traversal en elimineert een duplicaatstructuur die handmatig gesynchroniseerd moest worden.

tdjager · 2026-06-25T10:34:49Z

Thanks for the review. Re: the machine_readable: null semantics note — the behavior ({} counts as enriched, null does not) is documented on count_article_stats and pinned by test_count_article_stats_empty_vs_null_machine_readable (commit 3e6d600, now rebased into 4232d37), and no corpus file uses the bare/null form today.

The suggested corpus-level guard against a hypothetical future machine_readable: null is intentionally left out of scope here: this PR is a model-dedup refactor, and a runtime/CI guard on LLM enrichment output belongs in the enrich-output validation path (or schema), not in this change. Noting it as a sensible follow-up rather than folding it in. No code change for this finding.

…eline and harvester Advies 11: the two remaining consumers that kept a partial mirror of the canonical law-YAML model now source it from the leaf crate regelrecht-law-model, so the executable model has a single definition. pipeline: - Replace the untyped serde_yaml_ng::Value treewalk in count_article_stats (map.get("articles") / contains_key("machine_readable")) with a typed parse into ArticleBasedLaw. The read/count side is now type-checked against the single source of truth instead of stringly-typed field access. - The surgical write path is intentionally left untouched: production enrichment is performed by an external LLM subprocess that edits the file directly, so there is no Rust-side model write to unify. - Behavior change: a structurally-invalid law now surfaces as a parse error in the count rather than being silently undercounted — desirable, since this only runs on real harvested/enriched corpus files. Test fixtures updated to realistic minimal laws; an empty-machine_readable edge case is added. harvester: - The writer stays a deliberate write-projection (it carries preamble, organisation, cvdr_id and per-article references that the executable ArticleBasedLaw does not, with bespoke quoting/indent post-processing), but the internal YamlReference struct was a byte-identical duplicate of types::Reference, so it is dropped and types::Reference is serialized directly. - Add regelrecht-law-model as a dev-dependency and two regression tests: golden byte-identity for a BWB and a CVDR law (the repo previously only asserted substrings), and a conformance test proving the harvested output deserializes cleanly into ArticleBasedLaw.

Document and assert the one parity nuance between the typed count and the old contains_key walk: an empty `machine_readable: {}` counts as enriched (an LLM may insert a bare section), while an explicit `machine_readable: null` is treated as un-enriched. No corpus file uses the bare/null form, so this matches the previous behavior in practice. Addresses a ralph-review completeness finding.

claude

Review

Clean structural refactor — no critical or significant issues.

Harvester / pipeline — Harvester and pipeline

The YamlReference removal is byte-identical by construction (same field set, field order, and skip_serializing_if annotations as types::Reference); the golden tests pin this claim. The .clone() in the writer is the right call — no allocation overhead worth noting at parse time.

count_article_stats switching to ArticleBasedLaw is a strict improvement: the old contains_key("machine_readable") walk was silently tolerant of any structure, the new parse fails loudly on malformed corpus files, which is the right behavior in the enrichment path.

Minor

🟡 test_count_article_stats_empty_vs_null_machine_readable omits valid_from, while every other fixture in this PR includes it. If ArticleBasedLaw ever makes valid_from required for deserialization, this test breaks with a confusing error rather than an assertion failure. Trivial to add for consistency:

$id: test_law
regulatory_layer: WET
publication_date: '2025-01-01'
+valid_from: '2025-01-01'
articles:

Everything else looks good. Dependency classification is correct (regelrecht-law-model as [dependencies] in pipeline, [dev-dependencies] in harvester). The valid_from note is non-blocking.

claude · 2026-06-25T15:25:12Z

+publication_date: '2025-01-01'
+valid_from: '2025-01-01'
 articles:
-  - id: art1


🟡 valid_from is present in every other new test fixture in this PR but missing here. If ArticleBasedLaw ever requires it for deserialization, this test will fail with a parse error rather than a useful assertion failure.

$id: test_law regulatory_layer: WET publication_date: '2025-01-01' +valid_from: '2025-01-01' articles:

github-actions · 2026-06-25T15:30:33Z

Preview Deployment — enrichworker

Your changes have been deployed to a preview environment:

URL: https://enrichworker-pr858-regel-k4c.rig.prd1.gn2.quattro.rijksapps.nl

This deployment will be automatically cleaned up when the PR is closed.

github-actions · 2026-06-25T15:32:28Z

Preview Deployment — editor — editor

Your changes have been deployed to a preview environment:

URL: https://editor-pr858-regel-k4c.rig.prd1.gn2.quattro.rijksapps.nl

This deployment will be automatically cleaned up when the PR is closed.

tdjager marked this pull request as ready for review June 25, 2026 10:02

tdjager force-pushed the refactor/law-model-consumers branch from 3e6d600 to 4232d37 Compare June 25, 2026 10:08

claude Bot reviewed Jun 25, 2026

View reviewed changes

tdjager added 2 commits June 25, 2026 15:22

tdjager force-pushed the refactor/law-model-consumers branch from 4232d37 to 260fb89 Compare June 25, 2026 15:22

tdjager enabled auto-merge (squash) June 25, 2026 15:23

claude Bot reviewed Jun 25, 2026

View reviewed changes

tdjager merged commit 3253ba0 into main Jun 25, 2026
33 checks passed

tdjager deleted the refactor/law-model-consumers branch June 25, 2026 15:26

tdjager deployed to pr858 June 25, 2026 15:32 — with GitHub Actions View deployment

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

refactor: consume regelrecht-law-model in pipeline and harvester#858

refactor: consume regelrecht-law-model in pipeline and harvester#858
tdjager merged 2 commits into
mainfrom
refactor/law-model-consumers

tdjager commented Jun 25, 2026

Uh oh!

claude Bot left a comment

Uh oh!

claude Bot left a comment

Uh oh!

claude Bot left a comment

Uh oh!

tdjager commented Jun 25, 2026

Uh oh!

claude Bot left a comment

Uh oh!

claude Bot Jun 25, 2026

Uh oh!

Uh oh!

github-actions Bot commented Jun 25, 2026

Uh oh!

github-actions Bot commented Jun 25, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

tdjager commented Jun 25, 2026

What & why

pipeline — type the read/count side

harvester — drop a duplicate, keep the write-projection

Out of scope

Verification

Uh oh!

claude Bot left a comment

Choose a reason for hiding this comment

Harvester en pipeline

Correctheid

Gedragswijziging machine_readable: null

Overige observaties

Uh oh!

claude Bot left a comment

Choose a reason for hiding this comment

Harvester en pipeline

Correctheid

Gedragswijziging machine_readable: null

Overige observaties

Uh oh!

claude Bot left a comment

Choose a reason for hiding this comment

Harvester en pipeline

Correctheid

Gedragswijziging machine_readable: null

Overige observaties

Uh oh!

tdjager commented Jun 25, 2026

Uh oh!

claude Bot left a comment

Choose a reason for hiding this comment

Review

Harvester / pipeline — Harvester and pipeline

Minor

Uh oh!

claude Bot Jun 25, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

github-actions Bot commented Jun 25, 2026

Preview Deployment — enrichworker

Uh oh!

github-actions Bot commented Jun 25, 2026

Preview Deployment — editor — editor

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Gedragswijziging `machine_readable: null`

Gedragswijziging `machine_readable: null`