feat(postcodes/NL): 4,072 Dutch PC4 postcodes (#1039)#1518
Merged
Conversation
Adds the Netherlands' 4-digit postcode districts (PC4) aggregated from the mevdschee/postcodes-nl LGPL-3 mirror of Dutch BAG / Kadaster public address data. Why --- Closes the NL gap on issue #1039. The full PC6 (4-digit + 2-letter) list has 467,109 unique codes — would generate ~70 MB JSON, exceeding the in-band cities/*.json size envelope (PT.json at 38 MB is current largest). PC4 is the standard Dutch district-level granularity (~4,000 districts), comparable to UK postcode areas and Canada FSAs already shipped at this scale. Coverage -------- - 4,072 PC4 records / country-only state FK - Each row carries the most-common woonplaats (settlement) per PC4 as the representative locality_name State FK strategy ----------------- Country-only ship. The Netherlands' 12 provinces span PC4 ranges with significant overlap (e.g. PC4 1xxx covers Noord-Holland and Flevoland), so 1:1 PC4 -> province mapping would be misleading. Matches the SE / SI / GB precedent for sources without clean state hierarchy. Source pipeline --------------- 1. Resolve latest release URL via GitHub API 2. Fetch 17 MB 7zip archive 3. Extract via py7zr (pure-Python) -> 401 MB CSV 4. Stream-aggregate 9M+ street rows to (PC4, woonplaats) counts 5. Pick most-common woonplaats per PC4 Regex fix --------- Before this PR, NL regex was `^\d{4}\s?[a-zA-Z]{2}$` (PC6 only). Updated to `^\d{4}(?:\s?[A-Za-z]{2})?$` to accept PC4 also, matching the mixed-granularity pattern already permitted for GB / TW / CA / IR. Dependency ---------- Adds runtime dependency on `py7zr` (LGPL, pure-Python 7zip reader). Documented in importer docstring. License ------- mevdschee/postcodes-nl: LGPL-3.0. Upstream: Dutch Kadaster / BAG (Basisregistratie Adressen en Gebouwen) public open data. Each row: source: "bag-via-mevdschee" Validation ---------- - python3 -m py_compile passes - 100% regex match against updated NL regex - Country-only ship (no state_id), follows SE/SI/GB pattern - No auto-managed fields (id, created_at, updated_at, flag) Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Contributor
CSC Validation ReportPR Format
Labels applied:
|
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Source
mevdschee/postcodes-nl— LGPL-3 mirror, 17 MB 7zip / 401 MB CSV (v26.03 release)Why PC4 (not PC6)
Unique PC6 codes total 467,109. JSON-export at PC6 level would produce ~70 MB, exceeding the in-band cities/*.json envelope (PT.json at 38 MB is current largest). PC4 is the standard Dutch district-level granularity (~4,000 districts), comparable to UK postcode areas and Canada FSAs.
Why country-only state FK
Dutch PC4 ranges span multiple provinces with overlap (e.g. PC4 1xxx covers Noord-Holland and Flevoland). 1:1 mapping would be misleading. Matches the SE / SI / GB precedent.
Pipeline
py7zr→ 401 MB CSVRegex fix
Old:
^\d{4}\s?[a-zA-Z]{2}$(PC6 only)New:
^\d{4}(?:\s?[A-Za-z]{2})?$— accepts PC4 + PC6 with/without spaceMatches the mixed-granularity pattern already permitted for GB / TW / CA / IR.
Dependency
Adds runtime dependency on
py7zr(LGPL, pure-Python 7zip reader). Install viapython3 -m pip install py7zr.Sample rows
Test plan
python3 -m py_compile bin/scripts/sync/import_netherlands_postcodes.py^\d{4}(?:\s?[A-Za-z]{2})?$state_id); follows SE/SI/GB patternid,created_at,updated_at,flag)🤖 Generated with Claude Code