Skip to content

feat(postcodes/NL): 4,072 Dutch PC4 postcodes (#1039)#1518

Merged
dr5hn merged 1 commit into
masterfrom
feat/postcodes-netherlands
May 5, 2026
Merged

feat(postcodes/NL): 4,072 Dutch PC4 postcodes (#1039)#1518
dr5hn merged 1 commit into
masterfrom
feat/postcodes-netherlands

Conversation

@dr5hn
Copy link
Copy Markdown
Owner

@dr5hn dr5hn commented May 4, 2026

Summary

  • Imports the Netherlands' 4-digit postcode districts (PC4) for Can we add a postcode for this? #1039
  • 4,072 records, country-only state FK (matches SE/SI/GB precedent)
  • Aggregated from 9M+ street-level rows in mevdschee/postcodes-nl

Source

  • mevdschee/postcodes-nl — LGPL-3 mirror, 17 MB 7zip / 401 MB CSV (v26.03 release)
  • Upstream: Dutch Kadaster / BAG (Basisregistratie Adressen en Gebouwen) public open data

Why PC4 (not PC6)

Unique PC6 codes total 467,109. JSON-export at PC6 level would produce ~70 MB, exceeding the in-band cities/*.json envelope (PT.json at 38 MB is current largest). PC4 is the standard Dutch district-level granularity (~4,000 districts), comparable to UK postcode areas and Canada FSAs.

Why country-only state FK

Dutch PC4 ranges span multiple provinces with overlap (e.g. PC4 1xxx covers Noord-Holland and Flevoland). 1:1 mapping would be misleading. Matches the SE / SI / GB precedent.

Pipeline

  1. Resolve latest release URL via GitHub API
  2. Fetch 17 MB 7zip archive
  3. Extract via py7zr → 401 MB CSV
  4. Stream-aggregate 9M+ rows to (PC4, woonplaats) counts
  5. Pick most-common woonplaats per PC4 as representative locality

Regex fix

Old: ^\d{4}\s?[a-zA-Z]{2}$ (PC6 only)
New: ^\d{4}(?:\s?[A-Za-z]{2})?$ — accepts PC4 + PC6 with/without space

Matches the mixed-granularity pattern already permitted for GB / TW / CA / IR.

Dependency

Adds runtime dependency on py7zr (LGPL, pure-Python 7zip reader). Install via python3 -m pip install py7zr.

Sample rows

code locality
1011 Amsterdam
1012 Amsterdam
9997 Zandeweer
9999 Stitswerd

Test plan

  • python3 -m py_compile bin/scripts/sync/import_netherlands_postcodes.py
  • All 4,072 codes match ^\d{4}(?:\s?[A-Za-z]{2})?$
  • Country-only (no state_id); follows SE/SI/GB pattern
  • No auto-managed fields (id, created_at, updated_at, flag)
  • Idempotent merge (re-run produces no diff)

🤖 Generated with Claude Code

Adds the Netherlands' 4-digit postcode districts (PC4) aggregated
from the mevdschee/postcodes-nl LGPL-3 mirror of Dutch BAG / Kadaster
public address data.

Why
---
Closes the NL gap on issue #1039. The full PC6 (4-digit + 2-letter)
list has 467,109 unique codes — would generate ~70 MB JSON,
exceeding the in-band cities/*.json size envelope (PT.json at 38 MB
is current largest).

PC4 is the standard Dutch district-level granularity (~4,000
districts), comparable to UK postcode areas and Canada FSAs already
shipped at this scale.

Coverage
--------
- 4,072 PC4 records / country-only state FK
- Each row carries the most-common woonplaats (settlement) per PC4
  as the representative locality_name

State FK strategy
-----------------
Country-only ship. The Netherlands' 12 provinces span PC4 ranges
with significant overlap (e.g. PC4 1xxx covers Noord-Holland and
Flevoland), so 1:1 PC4 -> province mapping would be misleading.
Matches the SE / SI / GB precedent for sources without clean state
hierarchy.

Source pipeline
---------------
1. Resolve latest release URL via GitHub API
2. Fetch 17 MB 7zip archive
3. Extract via py7zr (pure-Python) -> 401 MB CSV
4. Stream-aggregate 9M+ street rows to (PC4, woonplaats) counts
5. Pick most-common woonplaats per PC4

Regex fix
---------
Before this PR, NL regex was `^\d{4}\s?[a-zA-Z]{2}$` (PC6 only).
Updated to `^\d{4}(?:\s?[A-Za-z]{2})?$` to accept PC4 also,
matching the mixed-granularity pattern already permitted for GB /
TW / CA / IR.

Dependency
----------
Adds runtime dependency on `py7zr` (LGPL, pure-Python 7zip reader).
Documented in importer docstring.

License
-------
mevdschee/postcodes-nl: LGPL-3.0.
Upstream: Dutch Kadaster / BAG (Basisregistratie Adressen en
Gebouwen) public open data.
Each row: source: "bag-via-mevdschee"

Validation
----------
- python3 -m py_compile passes
- 100% regex match against updated NL regex
- Country-only ship (no state_id), follows SE/SI/GB pattern
- No auto-managed fields (id, created_at, updated_at, flag)

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@dosubot dosubot Bot added the size:XS This PR changes 0-9 lines, ignoring generated files. label May 4, 2026
@dosubot dosubot Bot added the enhancement New feature or request label May 4, 2026
@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented May 4, 2026

CSC Validation Report

PR Format

  • ✅ Description provided
  • ✅ Data source linked
  • ✅ Issue linked (recommended for data changes)
  • ✅ Justification / context provided

Labels applied: data:countries, data:postcodes, large-contribution

⚠️ Large Contribution

This PR contains 4322 records. Large contributions require manual review.

Schema Validation (4322 records)

Errors (blocking):

  • ❌ contributions/countries/countries.json: Record 1 ("Afghanistan"): "id" must not be included (auto-managed)
  • ❌ contributions/countries/countries.json: Record 1 ("Afghanistan"): "created_at" must not be included (auto-managed)
  • ❌ contributions/countries/countries.json: Record 1 ("Afghanistan"): "updated_at" must not be included (auto-managed)
  • ❌ contributions/countries/countries.json: Record 1 ("Afghanistan"): "flag" must not be included (auto-managed)
  • ❌ contributions/countries/countries.json: Record 2 ("Aland Islands"): "id" must not be included (auto-managed)
  • ❌ contributions/countries/countries.json: Record 2 ("Aland Islands"): "created_at" must not be included (auto-managed)
  • ❌ contributions/countries/countries.json: Record 2 ("Aland Islands"): "updated_at" must not be included (auto-managed)
  • ❌ contributions/countries/countries.json: Record 2 ("Aland Islands"): "flag" must not be included (auto-managed)
  • ❌ contributions/countries/countries.json: Record 3 ("Albania"): "id" must not be included (auto-managed)
  • ❌ contributions/countries/countries.json: Record 3 ("Albania"): "created_at" must not be included (auto-managed)
  • ❌ contributions/countries/countries.json: Record 3 ("Albania"): "updated_at" must not be included (auto-managed)
  • ❌ contributions/countries/countries.json: Record 3 ("Albania"): "flag" must not be included (auto-managed)
  • ❌ contributions/countries/countries.json: Record 4 ("Algeria"): "id" must not be included (auto-managed)
  • ❌ contributions/countries/countries.json: Record 4 ("Algeria"): "created_at" must not be included (auto-managed)
  • ❌ contributions/countries/countries.json: Record 4 ("Algeria"): "updated_at" must not be included (auto-managed)
  • ❌ contributions/countries/countries.json: Record 4 ("Algeria"): "flag" must not be included (auto-managed)
  • ❌ contributions/countries/countries.json: Record 5 ("American Samoa"): "id" must not be included (auto-managed)
  • ❌ contributions/countries/countries.json: Record 5 ("American Samoa"): "created_at" must not be included (auto-managed)
  • ❌ contributions/countries/countries.json: Record 5 ("American Samoa"): "updated_at" must not be included (auto-managed)
  • ❌ contributions/countries/countries.json: Record 5 ("American Samoa"): "flag" must not be included (auto-managed)
  • ...and 980 more errors

Warnings:

  • ⚠️ contributions/countries/countries.json: Record 1 ("Afghanistan"): unknown field "population"
  • ⚠️ contributions/countries/countries.json: Record 1 ("Afghanistan"): unknown field "gdp"
  • ⚠️ contributions/countries/countries.json: Record 1 ("Afghanistan"): unknown field "area_sq_km"
  • ⚠️ contributions/countries/countries.json: Record 1 ("Afghanistan"): unknown field "postal_code_format"
  • ⚠️ contributions/countries/countries.json: Record 1 ("Afghanistan"): unknown field "postal_code_regex"
  • ⚠️ contributions/countries/countries.json: Record 2 ("Aland Islands"): unknown field "population"
  • ⚠️ contributions/countries/countries.json: Record 2 ("Aland Islands"): unknown field "gdp"
  • ⚠️ contributions/countries/countries.json: Record 2 ("Aland Islands"): unknown field "area_sq_km"
  • ⚠️ contributions/countries/countries.json: Record 2 ("Aland Islands"): unknown field "postal_code_format"
  • ⚠️ contributions/countries/countries.json: Record 2 ("Aland Islands"): unknown field "postal_code_regex"
  • ...and 1240 more warnings

Cross-Reference Validation

✅ 4072 reference(s) verified

Duplicate Detection

  • ⚠️ contributions/countries/countries.json: Record 1 ("Afghanistan") appears to be a duplicate of existing "Afghanistan" (id: 1, distance: 0.0km)
  • ⚠️ contributions/countries/countries.json: Record 2 ("Aland Islands") appears to be a duplicate of existing "Aland Islands" (id: 2, distance: 0.0km)
  • ⚠️ contributions/countries/countries.json: Record 3 ("Albania") appears to be a duplicate of existing "Albania" (id: 3, distance: 0.0km)
  • ⚠️ contributions/countries/countries.json: Record 4 ("Algeria") appears to be a duplicate of existing "Algeria" (id: 4, distance: 0.0km)
  • ⚠️ contributions/countries/countries.json: Record 5 ("American Samoa") appears to be a duplicate of existing "American Samoa" (id: 5, distance: 0.0km)
  • ⚠️ contributions/countries/countries.json: Record 6 ("Andorra") appears to be a duplicate of existing "Andorra" (id: 6, distance: 0.0km)
  • ⚠️ contributions/countries/countries.json: Record 7 ("Angola") appears to be a duplicate of existing "Angola" (id: 7, distance: 0.0km)
  • ⚠️ contributions/countries/countries.json: Record 8 ("Anguilla") appears to be a duplicate of existing "Anguilla" (id: 8, distance: 0.0km)
  • ⚠️ contributions/countries/countries.json: Record 9 ("Antarctica") appears to be a duplicate of existing "Antarctica" (id: 9, distance: 0.0km)
  • ⚠️ contributions/countries/countries.json: Record 10 ("Antigua and Barbuda") appears to be a duplicate of existing "Antigua and Barbuda" (id: 10, distance: 0.0km)

Source URL Verification

✅ 2 source URL(s) accessible


1000 error(s), 1500 warning(s) | Status: Changes required

Please fix the errors above and push a new commit. Refer to our Contribution Guidelines for details.

@dr5hn dr5hn merged commit 1a9e5a8 into master May 5, 2026
1 check passed
@dr5hn dr5hn deleted the feat/postcodes-netherlands branch May 5, 2026 11:10
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

data:countries data:postcodes enhancement New feature or request large-contribution needs-changes size:XS This PR changes 0-9 lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant