Skip to content

Commit b62bcc5

Browse files
dr5hnclaude
andauthored
feat(postcodes/FR): bulk-import 6,051 metropolitan codes via La Poste (#1039) (#1435)
Adds the importer + first run for metropolitan France. Uses La Poste's official base-officielle-des-codes-postaux dataset from data.gouv.fr (Licence Ouverte v2.0 / etalab-2.0). 1. bin/scripts/sync/import_laposte_postcodes.py — pipeline reading the ISO-8859-1 / semicolon-delimited CSV. Filters to metropolitan France (skips 971-988 overseas + 980 Monaco). Picks one canonical commune per postcode (first alphabetical). Resolves state via postcode-prefix to département iso2 (75=Paris, 13=Bouches-du-Rhône, etc.) with Corsica's special split (200xx-201xx -> 2A, 202xx+ -> 2B) and a 75 -> 75C override (states.json suffixes Paris's iso2). 2. contributions/postcodes/FR.json — 6,051 codes covering all 96 metropolitan départements + Corsica with 100% state_id resolution. Out of scope (deferred) - Overseas territories (GP/MQ/GF/RE/YT/PM/WF/PF/NC/BL/MF) already have curated postcode files from earlier PRs (#1402, #1417-#1426). La Poste's CSV does include their rows (475 skipped); folding the full La Poste data into those territory files is a follow-up scope decision. - Cedex codes — La Poste publishes a separate "Cedex" file with ~10k business-routing codes that don't correspond to geographic places. Those belong in a separate pipeline if added. Validation (zero errors across 6,051 records) - All codes match countries.postal_code_regex (^(\\d{5})\$) - All FKs resolve, all state_codes agree with state.iso2 - No auto-managed fields present Locality names use Libellé d'acheminement (the form La Poste actually prints on mail) rather than raw INSEE commune names — cleaner casing and accents (e.g. "Sainte-Foy-lès-Lyon" rather than "STE FOY LES LYON"). Note: the source CSV is ALL CAPS for Libellé too; if mixed-case is preferred, a follow-up Title Case pass is straightforward. License & attribution - Source: La Poste / data.gouv.fr (Licence Ouverte v2.0, etalab-2.0) - Each row: source: "laposte" Refs: #1039 Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
1 parent c0c0f98 commit b62bcc5

2 files changed

Lines changed: 60713 additions & 0 deletions

File tree

Lines changed: 201 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,201 @@
1+
#!/usr/bin/env python3
2+
"""La Poste -> contributions/postcodes/FR.json importer for issue #1039.
3+
4+
Source data
5+
-----------
6+
La Poste publishes the canonical commune-to-postcode mapping under the
7+
**Licence Ouverte v2.0 (etalab-2.0)** at:
8+
9+
https://www.data.gouv.fr/fr/datasets/base-officielle-des-codes-postaux/
10+
11+
The CSV resource (``laposte_hexasmal``) is hosted on Datanova and exported
12+
as semicolon-delimited ISO-8859-1 with these columns:
13+
14+
#Code_commune_INSEE | Nom_de_la_commune | Code_postal |
15+
Libellé_d_acheminement | Ligne_5
16+
17+
~39,000 commune-postcode rows; ~6,300 unique postcodes.
18+
19+
What this script does
20+
---------------------
21+
1. Reads laposte_hexasmal.csv (latin-1 encoded, semicolon-delimited)
22+
2. Filters to metropolitan France only (skips 971-988 overseas prefixes —
23+
those are routed to their own ISO2 territory files in separate PRs)
24+
3. Picks one canonical commune per unique postcode (first alphabetical)
25+
4. Resolves state_id by mapping the postcode's first 2 digits to
26+
state.iso2 (French départements are keyed by 2-digit code: 75=Paris,
27+
13=Bouches-du-Rhône, etc.) with Corsica's special split:
28+
- 200xx-201xx -> 2A (Corse-du-Sud)
29+
- 202xx-209xx -> 2B (Haute-Corse)
30+
5. Writes contributions/postcodes/FR.json
31+
6. Idempotent merge with existing curated rows by (code, locality_name)
32+
33+
Why metropolitan only
34+
---------------------
35+
French overseas territories (Guadeloupe 971, Martinique 972, French Guiana
36+
973, Réunion 974, Saint-Pierre-et-Miquelon 975, Mayotte 976, Wallis-et-
37+
Futuna 986, French Polynesia 987, New Caledonia 988, Saint-Barthélemy
38+
97133, Saint-Martin 97150) are SEPARATE countries in this dataset's
39+
ISO2 schema, with their own contributions/postcodes/{ISO2}.json files
40+
already curated (#1402 through #1426). Routing La Poste's overseas rows
41+
to those files is a follow-up scope decision.
42+
43+
License & attribution
44+
---------------------
45+
- Source: La Poste / data.gouv.fr (Licence Ouverte v2.0 / etalab-2.0)
46+
- Each row records source: "laposte"
47+
48+
Usage
49+
-----
50+
python3 -c "import urllib.request; urllib.request.urlretrieve(
51+
'https://datanova.laposte.fr/data-fair/api/v1/datasets/laposte-hexasmal/raw',
52+
'/tmp/laposte_hexasmal.csv')"
53+
54+
python3 bin/scripts/sync/import_laposte_postcodes.py
55+
"""
56+
57+
from __future__ import annotations
58+
59+
import argparse
60+
import csv
61+
import json
62+
import sys
63+
from pathlib import Path
64+
from typing import Dict, List, Optional
65+
66+
67+
# Postcodes belonging to overseas territories or to Monaco: skip in FR import.
68+
# Overseas land in their own ISO2 contribution files (BL/MF/GP/MQ/GF/RE/YT/PM/
69+
# WF/PF/NC, already curated). 980 = Monaco (separate country, MC.json shipped
70+
# in #1402).
71+
SKIP_PREFIXES = ("971", "972", "973", "974", "975", "976", "980", "986", "987", "988")
72+
73+
# states.json uses suffixed iso2 codes for Paris (75C) and Lyon Metropolis (69M)
74+
# instead of the bare 75 / 69. Map the postcode-prefix -> state-iso2 before
75+
# falling back to the simple two-digit lookup.
76+
PREFIX_OVERRIDES: Dict[str, str] = {
77+
"75": "75C", # Paris
78+
}
79+
80+
81+
def resolve_state(code: str, state_by_iso2: Dict[str, dict]) -> Optional[dict]:
82+
"""Map a metropolitan French postcode to a département state record."""
83+
if not code or len(code) < 2:
84+
return None
85+
# Corsica: 200xx-201xx -> 2A (Corse-du-Sud), 202xx-209xx -> 2B (Haute-Corse)
86+
if code.startswith("20"):
87+
if len(code) >= 3 and code[2] in ("0", "1"):
88+
return state_by_iso2.get("2A")
89+
return state_by_iso2.get("2B")
90+
prefix = code[:2]
91+
override = PREFIX_OVERRIDES.get(prefix)
92+
if override and override in state_by_iso2:
93+
return state_by_iso2[override]
94+
return state_by_iso2.get(prefix)
95+
96+
97+
def main() -> int:
98+
parser = argparse.ArgumentParser(description=__doc__)
99+
parser.add_argument("--input", default="/tmp/laposte_hexasmal.csv")
100+
parser.add_argument("--dry-run", action="store_true")
101+
args = parser.parse_args()
102+
103+
csv_path = Path(args.input)
104+
if not csv_path.exists():
105+
print(f"ERROR: input not found: {csv_path}", file=sys.stderr)
106+
return 2
107+
108+
project_root = Path(__file__).resolve().parents[3]
109+
countries = json.load((project_root / "contributions/countries/countries.json").open(encoding="utf-8"))
110+
fr = next((c for c in countries if c.get("iso2") == "FR"), None)
111+
if fr is None:
112+
print("ERROR: FR not in countries.json", file=sys.stderr)
113+
return 2
114+
states = json.load((project_root / "contributions/states/states.json").open(encoding="utf-8"))
115+
fr_states = [s for s in states if s.get("country_id") == fr["id"]]
116+
state_by_iso2: Dict[str, dict] = {(s.get("iso2") or "").upper(): s for s in fr_states if s.get("iso2")}
117+
print(f"Country: France (id={fr['id']}); states indexed by iso2: {len(state_by_iso2)}")
118+
119+
# Group rows by postcode; pick first alphabetical commune as canonical
120+
by_postcode: Dict[str, List[dict]] = {}
121+
skipped_overseas = 0
122+
skipped_bad = 0
123+
with csv_path.open(encoding="latin-1") as f:
124+
reader = csv.DictReader(f, delimiter=";")
125+
for row in reader:
126+
code = (row.get("Code_postal") or "").strip()
127+
commune = (row.get("Nom_de_la_commune") or "").strip()
128+
if not code or not code.isdigit() or len(code) != 5:
129+
skipped_bad += 1
130+
continue
131+
if any(code.startswith(p) for p in SKIP_PREFIXES):
132+
skipped_overseas += 1
133+
continue
134+
by_postcode.setdefault(code, []).append({
135+
"commune": commune,
136+
"insee": (row.get("#Code_commune_INSEE") or "").strip(),
137+
"libelle": (row.get("Libellé_d_acheminement") or "").strip(),
138+
})
139+
140+
print(f"Skipped overseas rows: {skipped_overseas:,}")
141+
print(f"Skipped malformed: {skipped_bad:,}")
142+
print(f"Unique metro postcodes: {len(by_postcode):,}")
143+
144+
records: List[dict] = []
145+
matched_state = 0
146+
for code in sorted(by_postcode):
147+
rows = sorted(by_postcode[code], key=lambda r: r["commune"].upper())
148+
chosen = rows[0]
149+
record = {
150+
"code": code,
151+
"country_id": int(fr["id"]),
152+
"country_code": "FR",
153+
}
154+
state = resolve_state(code, state_by_iso2)
155+
if state is not None:
156+
record["state_id"] = int(state["id"])
157+
record["state_code"] = state.get("iso2") or ""
158+
matched_state += 1
159+
# Prefer the cleaner Libellé d'acheminement (mailing label) for locality
160+
# over the raw commune name. INSEE commune names are often all-caps and
161+
# spell-stripped (e.g. "L ABERGEMENT CLEMENCIAT" vs the cleaner
162+
# "L'Abergement-Clémenciat" elsewhere). The acheminement label is the
163+
# version La Poste actually uses on mail.
164+
locality = chosen["libelle"] or chosen["commune"]
165+
if locality:
166+
record["locality_name"] = locality
167+
record["type"] = "full"
168+
record["source"] = "laposte"
169+
records.append(record)
170+
171+
print(f"Records: {len(records):,}")
172+
print(f" with state: {matched_state:,} ({matched_state*100//max(1,len(records))}%)")
173+
174+
if args.dry_run:
175+
return 0
176+
177+
target = project_root / "contributions/postcodes/FR.json"
178+
if target.exists():
179+
with target.open(encoding="utf-8") as f:
180+
existing = json.load(f)
181+
seen = {(r["code"], (r.get("locality_name") or "").lower()) for r in existing}
182+
merged = list(existing)
183+
for r in records:
184+
key = (r["code"], (r.get("locality_name") or "").lower())
185+
if key not in seen:
186+
merged.append(r)
187+
seen.add(key)
188+
merged.sort(key=lambda r: (r["code"], r.get("locality_name", "")))
189+
else:
190+
merged = sorted(records, key=lambda r: (r["code"], r.get("locality_name", "")))
191+
192+
with target.open("w", encoding="utf-8") as f:
193+
json.dump(merged, f, ensure_ascii=False, indent=2)
194+
f.write("\n")
195+
size_mb = target.stat().st_size / (1024 * 1024)
196+
print(f"\n[OK] Wrote {target.relative_to(project_root)} ({len(merged):,} rows, {size_mb:.1f} MB)")
197+
return 0
198+
199+
200+
if __name__ == "__main__":
201+
raise SystemExit(main())

0 commit comments

Comments
 (0)