Skip to content

Commit fc995df

Browse files
dr5hnclaude
andcommitted
feat(postcodes/NL): import 4,072 Dutch PC4 postcodes (#1039)
Adds the Netherlands' 4-digit postcode districts (PC4) aggregated from the mevdschee/postcodes-nl LGPL-3 mirror of Dutch BAG / Kadaster public address data. Why --- Closes the NL gap on issue #1039. The full PC6 (4-digit + 2-letter) list has 467,109 unique codes — would generate ~70 MB JSON, exceeding the in-band cities/*.json size envelope (PT.json at 38 MB is current largest). PC4 is the standard Dutch district-level granularity (~4,000 districts), comparable to UK postcode areas and Canada FSAs already shipped at this scale. Coverage -------- - 4,072 PC4 records / country-only state FK - Each row carries the most-common woonplaats (settlement) per PC4 as the representative locality_name State FK strategy ----------------- Country-only ship. The Netherlands' 12 provinces span PC4 ranges with significant overlap (e.g. PC4 1xxx covers Noord-Holland and Flevoland), so 1:1 PC4 -> province mapping would be misleading. Matches the SE / SI / GB precedent for sources without clean state hierarchy. Source pipeline --------------- 1. Resolve latest release URL via GitHub API 2. Fetch 17 MB 7zip archive 3. Extract via py7zr (pure-Python) -> 401 MB CSV 4. Stream-aggregate 9M+ street rows to (PC4, woonplaats) counts 5. Pick most-common woonplaats per PC4 Regex fix --------- Before this PR, NL regex was `^\d{4}\s?[a-zA-Z]{2}$` (PC6 only). Updated to `^\d{4}(?:\s?[A-Za-z]{2})?$` to accept PC4 also, matching the mixed-granularity pattern already permitted for GB / TW / CA / IR. Dependency ---------- Adds runtime dependency on `py7zr` (LGPL, pure-Python 7zip reader). Documented in importer docstring. License ------- mevdschee/postcodes-nl: LGPL-3.0. Upstream: Dutch Kadaster / BAG (Basisregistratie Adressen en Gebouwen) public open data. Each row: source: "bag-via-mevdschee" Validation ---------- - python3 -m py_compile passes - 100% regex match against updated NL regex - Country-only ship (no state_id), follows SE/SI/GB pattern - No auto-managed fields (id, created_at, updated_at, flag) Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
1 parent 085bfd5 commit fc995df

3 files changed

Lines changed: 32793 additions & 3 deletions

File tree

Lines changed: 212 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,212 @@
1+
#!/usr/bin/env python3
2+
"""Netherlands -> contributions/postcodes/NL.json importer for issue #1039.
3+
4+
Source data
5+
-----------
6+
The community ``mevdschee/postcodes-nl`` GitHub release ships a
7+
17 MB 7zip archive containing a 401 MB CSV with all Dutch street-
8+
level address data joining each postcode (PC6: 4-digit + 2-letter)
9+
to a woonplaats (settlement).
10+
11+
Source URL: https://github.com/mevdschee/postcodes-nl/releases/latest
12+
13+
What this script does
14+
---------------------
15+
1. Resolves the latest release via the GitHub API.
16+
2. Fetches the 17 MB 7zip and extracts the CSV in-memory via py7zr.
17+
3. Aggregates 9M+ street-level rows into 4,072 unique PC4
18+
(4-digit) districts, picking the most-common woonplaats per PC4
19+
as the representative locality.
20+
4. Writes contributions/postcodes/NL.json idempotently.
21+
22+
Why PC4 (not PC6)
23+
-----------------
24+
Unique PC6 codes total 467,109. JSON-export at PC6 level would
25+
exceed the in-band cities/*.json size envelope (PT.json at 38 MB
26+
is the current largest; PC6 expansion would be ~70 MB).
27+
28+
PC4 (4-digit) is the standard Dutch district-level postcode
29+
granularity (~4,000 districts), comparable to UK postcode areas
30+
and Canada FSAs already shipped at this scale.
31+
32+
Why country-only state FK
33+
-------------------------
34+
The Netherlands' 12 provinces span PC4 ranges with significant
35+
overlap (e.g. PC4 1xxx covers parts of Noord-Holland and Flevoland).
36+
A 1:1 PC4 -> province map would be misleading. CSC matches the
37+
SE / SI / GB precedent for sources that don't map cleanly to a
38+
hierarchy.
39+
40+
Regex fix
41+
---------
42+
Before this PR, countries.json had NL regex `^\\d{4}\\s?[a-zA-Z]{2}$`
43+
(PC6 only). Updated to `^\\d{4}(?:\\s?[A-Za-z]{2})?$` to accept
44+
PC4 also, matching the mixed-granularity pattern already permitted
45+
for GB / TW / CA / IR.
46+
47+
Dependency
48+
----------
49+
- ``py7zr`` (LGPL, pure-Python 7zip reader). Install via:
50+
python3 -m pip install py7zr
51+
52+
License & attribution
53+
---------------------
54+
- Source: mevdschee/postcodes-nl (LGPL-3 per repo LICENSE)
55+
- Upstream: Dutch Kadaster / BAG (Basisregistratie Adressen en
56+
Gebouwen), public open-data lookup
57+
- Each row: ``source: "bag-via-mevdschee"``
58+
59+
Usage
60+
-----
61+
python3 bin/scripts/sync/import_netherlands_postcodes.py
62+
"""
63+
64+
from __future__ import annotations
65+
66+
import argparse
67+
import collections
68+
import csv
69+
import io
70+
import json
71+
import os
72+
import re
73+
import sys
74+
import tempfile
75+
import urllib.request
76+
from pathlib import Path
77+
from typing import Dict, List
78+
79+
import py7zr
80+
81+
82+
RELEASES_API = "https://api.github.com/repos/mevdschee/postcodes-nl/releases/latest"
83+
84+
85+
def fetch_bytes(url: str) -> bytes:
86+
req = urllib.request.Request(
87+
url, headers={"User-Agent": "csc-database-postcode-importer"}
88+
)
89+
with urllib.request.urlopen(req, timeout=600) as r:
90+
return r.read()
91+
92+
93+
def resolve_archive_url() -> str:
94+
req = urllib.request.Request(
95+
RELEASES_API, headers={"User-Agent": "csc-database-postcode-importer"}
96+
)
97+
with urllib.request.urlopen(req, timeout=30) as r:
98+
meta = json.loads(r.read())
99+
for asset in meta.get("assets", []):
100+
if asset.get("name", "").endswith(".7z"):
101+
return asset["browser_download_url"]
102+
raise RuntimeError("No .7z asset in latest release")
103+
104+
105+
def main() -> int:
106+
parser = argparse.ArgumentParser(description=__doc__)
107+
parser.add_argument("--input", default=None, help="local 7z (skip fetch)")
108+
parser.add_argument("--dry-run", action="store_true")
109+
args = parser.parse_args()
110+
111+
if args.input:
112+
raw = Path(args.input).read_bytes()
113+
else:
114+
url = resolve_archive_url()
115+
print(f"fetching {url}")
116+
raw = fetch_bytes(url)
117+
print(f"7z size: {len(raw):,} bytes")
118+
119+
project_root = Path(__file__).resolve().parents[3]
120+
countries = json.load(
121+
(project_root / "contributions/countries/countries.json").open(encoding="utf-8")
122+
)
123+
nl_country = next((c for c in countries if c.get("iso2") == "NL"), None)
124+
if nl_country is None:
125+
print("ERROR: NL not in countries.json", file=sys.stderr)
126+
return 2
127+
regex = re.compile(nl_country.get("postal_code_regex") or ".*")
128+
print(f"Country: Netherlands (id={nl_country['id']})")
129+
130+
# Aggregate to PC4 with most-common woonplaats per PC4
131+
pc4_woonplaatsen: Dict[str, collections.Counter] = collections.defaultdict(
132+
collections.Counter
133+
)
134+
135+
with tempfile.TemporaryDirectory() as tmp:
136+
with py7zr.SevenZipFile(io.BytesIO(raw), mode="r") as archive:
137+
archive.extractall(path=tmp)
138+
csv_path = next(
139+
os.path.join(tmp, n)
140+
for n in os.listdir(tmp)
141+
if n.endswith(".csv")
142+
)
143+
print(f"CSV: {os.path.getsize(csv_path):,} bytes")
144+
with open(csv_path, encoding="utf-8") as f:
145+
reader = csv.DictReader(f)
146+
for i, row in enumerate(reader):
147+
if i and i % 2_000_000 == 0:
148+
print(f" processed {i:,}")
149+
pc = (row.get("postcode") or "").replace(" ", "").upper()
150+
wp = (row.get("woonplaats") or "").strip()
151+
if len(pc) >= 4 and pc[:4].isdigit():
152+
pc4_woonplaatsen[pc[:4]][wp] += 1
153+
154+
print(f"unique PC4: {len(pc4_woonplaatsen):,}")
155+
156+
records: List[dict] = []
157+
skipped_bad_regex = 0
158+
159+
for pc4 in sorted(pc4_woonplaatsen):
160+
if not regex.match(pc4):
161+
skipped_bad_regex += 1
162+
continue
163+
wp = pc4_woonplaatsen[pc4].most_common(1)[0][0]
164+
165+
record: Dict[str, object] = {
166+
"code": pc4,
167+
"country_id": int(nl_country["id"]),
168+
"country_code": "NL",
169+
}
170+
if wp:
171+
record["locality_name"] = wp
172+
record["type"] = "area"
173+
record["source"] = "bag-via-mevdschee"
174+
records.append(record)
175+
176+
print(f"Skipped (regex fail): {skipped_bad_regex:,}")
177+
print(f"Records emitted: {len(records):,}")
178+
179+
if args.dry_run:
180+
return 0
181+
182+
target = project_root / "contributions/postcodes/NL.json"
183+
target.parent.mkdir(parents=True, exist_ok=True)
184+
if target.exists():
185+
with target.open(encoding="utf-8") as f:
186+
existing = json.load(f)
187+
existing_seen = {
188+
(r["code"], (r.get("locality_name") or "").lower()) for r in existing
189+
}
190+
merged = list(existing)
191+
for r in records:
192+
key = (r["code"], (r.get("locality_name") or "").lower())
193+
if key not in existing_seen:
194+
merged.append(r)
195+
existing_seen.add(key)
196+
merged.sort(key=lambda r: (r["code"], r.get("locality_name", "")))
197+
else:
198+
merged = sorted(records, key=lambda r: (r["code"], r.get("locality_name", "")))
199+
200+
with target.open("w", encoding="utf-8") as f:
201+
json.dump(merged, f, ensure_ascii=False, indent=2)
202+
f.write("\n")
203+
size_kb = target.stat().st_size / 1024
204+
print(
205+
f"\n[OK] Wrote {target.relative_to(project_root)} "
206+
f"({len(merged):,} rows, {size_kb:.0f} KB)"
207+
)
208+
return 0
209+
210+
211+
if __name__ == "__main__":
212+
raise SystemExit(main())

contributions/countries/countries.json

Lines changed: 3 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -10434,8 +10434,8 @@
1043410434
"subregion_id": 17,
1043510435
"nationality": "Dutch, Netherlandic",
1043610436
"area_sq_km": 41526.0,
10437-
"postal_code_format": "#### @@",
10438-
"postal_code_regex": "^(\\d{4}\\s?[a-zA-Z]{2})$",
10437+
"postal_code_format": "####|#### @@",
10438+
"postal_code_regex": "^(\\d{4}(?:\\s?[A-Za-z]{2})?)$",
1043910439
"timezones": [
1044010440
{
1044110441
"zoneName": "Europe/Amsterdam",
@@ -16747,4 +16747,4 @@
1674716747
"flag": 1,
1674816748
"wikiDataId": "Q26273"
1674916749
}
16750-
]
16750+
]

0 commit comments

Comments
 (0)