Skip to content

refactor: consume regelrecht-law-model in pipeline and harvester#858

Merged
tdjager merged 2 commits into
mainfrom
refactor/law-model-consumers
Jun 25, 2026
Merged

refactor: consume regelrecht-law-model in pipeline and harvester#858
tdjager merged 2 commits into
mainfrom
refactor/law-model-consumers

Conversation

@tdjager

@tdjager tdjager commented Jun 25, 2026

Copy link
Copy Markdown
Collaborator

What & why

Advies 11 from the codebase audit: the canonical law-YAML model lives in the leaf crate regelrecht-law-model (already consumed by engine, corpus, tui). The two remaining consumers kept a partial mirror, so a schema change had to be hand-synced in several places. This routes the executable model through the single source while respecting the formatting/projection code that a naive parse→serialize roundtrip would break.

pipeline — type the read/count side

Before: count_article_stats walked the law as untyped serde_yaml_ng::Valuemap.get("articles"), article_map.contains_key("machine_readable") (stringly-typed field access).

After: it parses into regelrecht_law_model::ArticleBasedLaw and counts articles.len() / articles.iter().filter(|a| a.machine_readable.is_some()).

  • The surgical write path is intentionally left untouched: production enrichment is done by an external LLM subprocess that edits the file directly, so there is no Rust-side model write to unify.
  • Behavior change: a structurally-invalid law now surfaces as a parse error in the count instead of being silently undercounted — desirable, as this only runs on real harvested/enriched corpus files. Test fixtures updated to realistic minimal laws + an empty-machine_readable edge case.

harvester — drop a duplicate, keep the write-projection

The writer is a deliberate write-projection: it emits preamble, organisation, cvdr_id and per-article references (which the executable ArticleBasedLaw does not carry) with bespoke quoting/indent post-processing, so it is not routed through ArticleBasedLaw (that would change byte output and lose data).

  • Before: an internal YamlReference struct duplicated types::Reference field-for-field. After: it is removed and types::Reference is serialized directly (byte-identical — same fields, order, skip_serializing_if).
  • Added regelrecht-law-model as a dev-dependency and two regression tests the repo lacked: golden byte-identity for a BWB and a CVDR law (previously only substrings were asserted), and a conformance test that the harvested output deserializes cleanly into ArticleBasedLaw.

Out of scope

No new fields / reordering / skip_serializing_if churn on ArticleBasedLaw (it is deserialize-only today and harvested output must stay byte-identical). harvester/src/types.rs Law/Article/Preamble/Reference stay — they model the scraped source document, not the YAML document.

Verification

cargo fmt --check, clippy --all-features -Dwarnings, cargo check, and the harvester + pipeline test suites are green. The new harvester golden test is the byte-identity regression proof.

@tdjager tdjager marked this pull request as ready for review June 25, 2026 10:02
@tdjager tdjager force-pushed the refactor/law-model-consumers branch from 3e6d600 to 4232d37 Compare June 25, 2026 10:08

@claude claude Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Harvester en pipeline

Correctheid

De kernwijziging in pipeline/src/enrich.rs is correct. ArticleBasedLaw heeft machine_readable: Option<MachineReadable> en valid_from: Option<String>, zodat de typed parse correct werkt voor alle bestaande fixtures.

De byte-identiteitsbelofte van YamlReferenceReference is geverifieerd: beide structs hebben exact dezelfde velden, volgorde en skip_serializing_if-annotaties (packages/harvester/src/types.rs:197). De gouden tests leggen dit nu formeel vast.

Gedragswijziging machine_readable: null

🟡 Minor — De overstap van contains_key("machine_readable") naar article.machine_readable.is_some() verandert de semantiek voor machine_readable: null:

  • Oud: null heeft een key → telt als verrijkt
  • Nieuw: null deserialiseert als None → telt als onverrijkt

De PR documenteert dit expliciet en stelt dat geen enkel corpusbestand de null-vorm gebruikt. Dat is aannemelijk, maar als een LLM ooit machine_readable: null emitteert in plaats van de key weg te laten, daalt de verrijkingsteller stil. De bestaande test dekt de nieuwe semantiek, maar een expliciete bewaking op corpusniveau ontbreekt. Geen blocker, maar het is het waard om in het oog te houden bij toekomstige LLM-uitvoerchecks.

Overige observaties

De conformancetest (test_generated_yaml_conforms_to_law_model) is een mooie toevoeging: hij bewijst dat de harvester-schrijver trouw blijft aan het canonieke model zonder dat ze structureel samengevoegd hoeven te worden.

Geen kritische of significante bevindingen. De refactor verwijdert terecht stringly-typed YAML-traversal en elimineert een duplicaatstructuur die handmatig gesynchroniseerd moest worden.

@claude claude Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Harvester en pipeline

Correctheid

De kernwijziging in pipeline/src/enrich.rs is correct. ArticleBasedLaw heeft machine_readable: Option<MachineReadable> en valid_from: Option<String>, zodat de typed parse correct werkt voor alle bestaande fixtures.

De byte-identiteitsbelofte van YamlReferenceReference is geverifieerd: beide structs hebben exact dezelfde velden, volgorde en skip_serializing_if-annotaties (packages/harvester/src/types.rs:197). De gouden tests leggen dit nu formeel vast.

Gedragswijziging machine_readable: null

🟡 Minor — De overstap van contains_key("machine_readable") naar article.machine_readable.is_some() verandert de semantiek voor machine_readable: null:

  • Oud: null heeft een key, telt als verrijkt
  • Nieuw: null deserialiseert als None, telt als onverrijkt

De PR documenteert dit expliciet en stelt dat geen enkel corpusbestand de null-vorm gebruikt. Dat is aannemelijk, maar als een LLM ooit machine_readable: null emitteert in plaats van de key weg te laten, daalt de verrijkingsteller stil. De bestaande test dekt de nieuwe semantiek, maar een expliciete bewaking op corpusniveau ontbreekt. Geen blocker, maar het is het waard om in het oog te houden bij toekomstige LLM-uitvoerchecks.

Overige observaties

De conformancetest (test_generated_yaml_conforms_to_law_model) is een mooie toevoeging: hij bewijst dat de harvester-schrijver trouw blijft aan het canonieke model zonder dat ze structureel samengevoegd hoeven te worden.

Geen kritische of significante bevindingen. De refactor verwijdert terecht stringly-typed YAML-traversal en elimineert een duplicaatstructuur die handmatig gesynchroniseerd moest worden.

@claude claude Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Harvester en pipeline

Correctheid

De kernwijziging in pipeline/src/enrich.rs is correct. ArticleBasedLaw heeft machine_readable: Option<MachineReadable> en valid_from: Option<String>, zodat de typed parse correct werkt voor alle bestaande fixtures.

De byte-identiteitsbelofte van YamlReferenceReference is geverifieerd: beide structs hebben exact dezelfde velden, volgorde en skip_serializing_if-annotaties (packages/harvester/src/types.rs:197). De gouden tests leggen dit nu formeel vast.

Gedragswijziging machine_readable: null

🟡 Minor — De overstap van contains_key("machine_readable") naar article.machine_readable.is_some() verandert de semantiek voor machine_readable: null:

  • Oud: null heeft een key → telt als verrijkt
  • Nieuw: null deserialiseert als None → telt als onverrijkt

De PR documenteert dit expliciet en stelt dat geen enkel corpusbestand de null-vorm gebruikt. Dat is aannemelijk, maar als een LLM ooit machine_readable: null emitteert in plaats van de key weg te laten, daalt de verrijkingsteller stil. De bestaande test dekt de nieuwe semantiek, maar een expliciete bewaking op corpusniveau ontbreekt. Geen blocker, maar het is het waard om in het oog te houden bij toekomstige LLM-uitvoerchecks.

Overige observaties

De conformancetest (test_generated_yaml_conforms_to_law_model) is een mooie toevoeging: hij bewijst dat de harvester-schrijver trouw blijft aan het canonieke model zonder dat ze structureel samengevoegd hoeven te worden.

Geen kritische of significante bevindingen. De refactor verwijdert terecht stringly-typed YAML-traversal en elimineert een duplicaatstructuur die handmatig gesynchroniseerd moest worden.

@tdjager

tdjager commented Jun 25, 2026

Copy link
Copy Markdown
Collaborator Author

Thanks for the review. Re: the machine_readable: null semantics note — the behavior ({} counts as enriched, null does not) is documented on count_article_stats and pinned by test_count_article_stats_empty_vs_null_machine_readable (commit 3e6d600, now rebased into 4232d37), and no corpus file uses the bare/null form today.

The suggested corpus-level guard against a hypothetical future machine_readable: null is intentionally left out of scope here: this PR is a model-dedup refactor, and a runtime/CI guard on LLM enrichment output belongs in the enrich-output validation path (or schema), not in this change. Noting it as a sensible follow-up rather than folding it in. No code change for this finding.

tdjager added 2 commits June 25, 2026 15:22
…eline and harvester

Advies 11: the two remaining consumers that kept a partial mirror of the
canonical law-YAML model now source it from the leaf crate
regelrecht-law-model, so the executable model has a single definition.

pipeline:
- Replace the untyped serde_yaml_ng::Value treewalk in count_article_stats
  (map.get("articles") / contains_key("machine_readable")) with a typed parse
  into ArticleBasedLaw. The read/count side is now type-checked against the
  single source of truth instead of stringly-typed field access.
- The surgical write path is intentionally left untouched: production
  enrichment is performed by an external LLM subprocess that edits the file
  directly, so there is no Rust-side model write to unify.
- Behavior change: a structurally-invalid law now surfaces as a parse error in
  the count rather than being silently undercounted — desirable, since this
  only runs on real harvested/enriched corpus files. Test fixtures updated to
  realistic minimal laws; an empty-machine_readable edge case is added.

harvester:
- The writer stays a deliberate write-projection (it carries preamble,
  organisation, cvdr_id and per-article references that the executable
  ArticleBasedLaw does not, with bespoke quoting/indent post-processing), but
  the internal YamlReference struct was a byte-identical duplicate of
  types::Reference, so it is dropped and types::Reference is serialized
  directly.
- Add regelrecht-law-model as a dev-dependency and two regression tests:
  golden byte-identity for a BWB and a CVDR law (the repo previously only
  asserted substrings), and a conformance test proving the harvested output
  deserializes cleanly into ArticleBasedLaw.
Document and assert the one parity nuance between the typed count and the old
contains_key walk: an empty `machine_readable: {}` counts as enriched (an LLM may
insert a bare section), while an explicit `machine_readable: null` is treated as
un-enriched. No corpus file uses the bare/null form, so this matches the previous
behavior in practice. Addresses a ralph-review completeness finding.
@tdjager tdjager force-pushed the refactor/law-model-consumers branch from 4232d37 to 260fb89 Compare June 25, 2026 15:22
@tdjager tdjager enabled auto-merge (squash) June 25, 2026 15:23

@claude claude Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review

Clean structural refactor — no critical or significant issues.

Harvester / pipeline — Harvester and pipeline

The YamlReference removal is byte-identical by construction (same field set, field order, and skip_serializing_if annotations as types::Reference); the golden tests pin this claim. The .clone() in the writer is the right call — no allocation overhead worth noting at parse time.

count_article_stats switching to ArticleBasedLaw is a strict improvement: the old contains_key("machine_readable") walk was silently tolerant of any structure, the new parse fails loudly on malformed corpus files, which is the right behavior in the enrichment path.

Minor

🟡 test_count_article_stats_empty_vs_null_machine_readable omits valid_from, while every other fixture in this PR includes it. If ArticleBasedLaw ever makes valid_from required for deserialization, this test breaks with a confusing error rather than an assertion failure. Trivial to add for consistency:

$id: test_law
regulatory_layer: WET
publication_date: '2025-01-01'
+valid_from: '2025-01-01'
articles:

Everything else looks good. Dependency classification is correct (regelrecht-law-model as [dependencies] in pipeline, [dev-dependencies] in harvester). The valid_from note is non-blocking.

publication_date: '2025-01-01'
valid_from: '2025-01-01'
articles:
- id: art1

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 valid_from is present in every other new test fixture in this PR but missing here. If ArticleBasedLaw ever requires it for deserialization, this test will fail with a parse error rather than a useful assertion failure.

$id: test_law
regulatory_layer: WET
publication_date: '2025-01-01'
+valid_from: '2025-01-01'
articles:

@tdjager tdjager merged commit 3253ba0 into main Jun 25, 2026
33 checks passed
@tdjager tdjager deleted the refactor/law-model-consumers branch June 25, 2026 15:26
@github-actions

Copy link
Copy Markdown

Preview Deployment — enrichworker

Your changes have been deployed to a preview environment:

URL: https://enrichworker-pr858-regel-k4c.rig.prd1.gn2.quattro.rijksapps.nl

This deployment will be automatically cleaned up when the PR is closed.

@github-actions

Copy link
Copy Markdown

Preview Deployment — editor — editor

Your changes have been deployed to a preview environment:

URL: https://editor-pr858-regel-k4c.rig.prd1.gn2.quattro.rijksapps.nl

This deployment will be automatically cleaned up when the PR is closed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant