Skip to content

Commit 0d778c2

Browse files
dr5hnclaude
andauthored
feat(postcodes/JP): bulk-import 120,677 codes via Japan Post KEN_ALL (#1039) (#1433)
Largest single-source postcode import yet. Adds: 1. bin/scripts/sync/import_japan_post_postcodes.py — pipeline reading Japan Post's KEN_ALL.CSV (Shift-JIS encoded). Picks one canonical record per unique 7-digit code (KEN_ALL's natural ordering is geographic by JIS municipality code, so the first hit per zip is a stable primary). Resolves prefecture by stripping the 都/道/府/県 suffix from the kanji name before matching against states.native. 2. contributions/postcodes/JP.json — 120,677 codes covering all 47 prefectures with 100% state_id resolution. Format normalisation - KEN_ALL stores 7-digit zips without separator (0600000); regex on countries.json requires "###-####" form (^\\d{3}-\\d{4}\$). Pipeline inserts the hyphen. Locality naming - KEN_ALL splits address into city (col 7) + town (col 8), both kanji - pipeline builds "城+町" composite (e.g. "札幌市中央区旭ケ丘") for normal rows, and falls back to city-only when town is the catch-all placeholder "以下に掲載がない場合" ("none of the below listed") State resolution (100%) - 47 prefectures, all matched - KEN_ALL: 北海道, 青森県, 東京都, 大阪府, etc. - states.json native: 北海道, 青森, 東京, 大阪, etc. (suffix often missing) - Suffix-strip lookup: KEN_ALL '青森県' -> '青森' -> matches state native Validation (zero errors across 120,677 records) - All codes match countries.postal_code_regex (^\\d{3}-\\d{4}\$) - All country_id/state_id foreign keys resolve - All state_code values agree with state.iso2 - No auto-managed fields present License & attribution - Source: Japan Post (free redistribution, including commercial) - Each row: source: "japan-post" File size - 25.7 MB JSON. Larger than India (3.8 MB) or US (7.8 MB) but in-band with the repo's existing precedent (cities/US.json is 22 MB, cities/IT.json is 11 MB). If size becomes a concern, the existing gz-to-Releases pattern (#1374) extends mechanically to postcodes. Refs: #1039 Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
1 parent 13152af commit 0d778c2

2 files changed

Lines changed: 1206999 additions & 0 deletions

File tree

Lines changed: 227 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,227 @@
1+
#!/usr/bin/env python3
2+
"""Japan Post KEN_ALL -> contributions/postcodes/JP.json importer for issue #1039.
3+
4+
Source data
5+
-----------
6+
Japan Post publishes the canonical 7-digit postcode-to-locality mapping at:
7+
8+
https://www.post.japanpost.jp/zipcode/dl/kogaki/zip/ken_all.zip
9+
10+
The file (KEN_ALL.CSV) is **Shift-JIS encoded**, ~12 MB raw, ~125,000 rows
11+
covering all 47 prefectures. Each row has 15 columns; the relevant ones are:
12+
13+
col 0 JIS X 0401/0402 municipality code
14+
col 1 legacy 5-digit zip (post-war system, kept for compatibility)
15+
col 2 modern 7-digit zip
16+
col 3-5 half-width katakana (prefecture, city, town)
17+
col 6-8 kanji (prefecture, city, town)
18+
col 9-14 various flags
19+
20+
The dataset publishes one row per (zip, town) pair, so a single zip code
21+
that serves multiple towns appears multiple times. ~120,000 unique 7-digit
22+
codes total.
23+
24+
Licence
25+
-------
26+
Japan Post permits free redistribution of the KEN_ALL data, including for
27+
commercial use, with no formal licence (effectively public-domain in
28+
practice). Each generated row records ``source: "japan-post"`` for
29+
provenance.
30+
31+
What this script does
32+
---------------------
33+
1. Reads KEN_ALL.CSV (Shift-JIS) row by row
34+
2. Picks ONE representative locality per unique 7-digit zip — the FIRST
35+
row encountered (KEN_ALL's natural ordering by JIS code is geographic,
36+
so this gives a stable "primary" town per zip without scoring)
37+
3. Resolves country_id (JP) and state_id by stripping the prefecture suffix
38+
(県/府/都/道) from the kanji name and matching against states.native
39+
4. Emits codes in the canonical "###-####" format (regex requires it)
40+
5. Writes contributions/postcodes/JP.json (~120,000 records)
41+
6. Idempotent merge with existing curated rows by (code, locality_name)
42+
43+
Usage
44+
-----
45+
python3 -c "import urllib.request; urllib.request.urlretrieve(
46+
'https://www.post.japanpost.jp/zipcode/dl/kogaki/zip/ken_all.zip',
47+
'/tmp/ken_all.zip')"
48+
unzip -o /tmp/ken_all.zip -d /tmp/
49+
50+
python3 bin/scripts/sync/import_japan_post_postcodes.py \\
51+
--input /tmp/KEN_ALL.CSV
52+
"""
53+
54+
from __future__ import annotations
55+
56+
import argparse
57+
import csv
58+
import json
59+
import sys
60+
from pathlib import Path
61+
from typing import Dict, List, Optional
62+
63+
64+
PREFECTURE_SUFFIXES = ("都", "道", "府", "県") # Tokyo-to, Hokkai-do, Osaka/Kyoto-fu, *-ken
65+
66+
67+
def strip_prefecture_suffix(name: str) -> str:
68+
"""Drop the trailing 都/道/府/県 so 北海道 -> 北海, 青森県 -> 青森, etc.
69+
70+
Hokkaido is the special case — its native form in states.json is
71+
literally '北海道' (the full name), so we keep that record's native
72+
intact and match BOTH the suffix-stripped CSV form ('北海') AND the
73+
full form. A two-pass lookup handles this without hardcoding.
74+
"""
75+
if name and name.endswith(PREFECTURE_SUFFIXES):
76+
return name[:-1]
77+
return name
78+
79+
80+
def build_state_lookup(states_for_jp: List[dict]) -> Dict[str, dict]:
81+
"""Native-name -> state. Keys include both raw and suffix-stripped forms."""
82+
lookup: Dict[str, dict] = {}
83+
for s in states_for_jp:
84+
native = (s.get("native") or "").strip()
85+
if not native:
86+
continue
87+
lookup[native] = s
88+
stripped = strip_prefecture_suffix(native)
89+
if stripped:
90+
lookup[stripped] = s
91+
return lookup
92+
93+
94+
def resolve_state(csv_kanji: str, lookup: Dict[str, dict]) -> Optional[dict]:
95+
"""Match KEN_ALL's kanji prefecture name to a state record."""
96+
if not csv_kanji:
97+
return None
98+
# Direct match (handles the few states.json entries that already include 県/府)
99+
if csv_kanji in lookup:
100+
return lookup[csv_kanji]
101+
# Suffix-stripped fallback (the common case)
102+
stripped = strip_prefecture_suffix(csv_kanji)
103+
if stripped in lookup:
104+
return lookup[stripped]
105+
return None
106+
107+
108+
def format_code(zip7: str) -> Optional[str]:
109+
"""Convert KEN_ALL's 7-digit zip to canonical ###-#### form."""
110+
z = (zip7 or "").strip()
111+
if len(z) == 7 and z.isdigit():
112+
return f"{z[:3]}-{z[3:]}"
113+
return None
114+
115+
116+
def main() -> int:
117+
parser = argparse.ArgumentParser(description=__doc__)
118+
parser.add_argument("--input", default="/tmp/KEN_ALL.CSV",
119+
help="Path to KEN_ALL.CSV (Shift-JIS encoded)")
120+
parser.add_argument("--dry-run", action="store_true")
121+
args = parser.parse_args()
122+
123+
csv_path = Path(args.input)
124+
if not csv_path.exists():
125+
print(f"ERROR: input not found: {csv_path}", file=sys.stderr)
126+
return 2
127+
128+
project_root = Path(__file__).resolve().parents[3]
129+
countries = json.load((project_root / "contributions/countries/countries.json").open(encoding="utf-8"))
130+
jp = next((c for c in countries if c.get("iso2") == "JP"), None)
131+
if jp is None:
132+
print("ERROR: JP not in countries.json", file=sys.stderr)
133+
return 2
134+
states = json.load((project_root / "contributions/states/states.json").open(encoding="utf-8"))
135+
jp_states = [s for s in states if s.get("country_id") == jp["id"]]
136+
state_lookup = build_state_lookup(jp_states)
137+
print(f"Country: Japan (id={jp['id']}); states indexed: {len(jp_states)}")
138+
139+
seen: set = set()
140+
records: List[dict] = []
141+
unresolved_prefs: Dict[str, int] = {}
142+
bad_codes = 0
143+
144+
with csv_path.open(encoding="shift_jis", errors="replace") as f:
145+
reader = csv.reader(f)
146+
for row in reader:
147+
if len(row) < 9:
148+
continue
149+
zip7 = row[2]
150+
kanji_pref = (row[6] or "").strip()
151+
kanji_city = (row[7] or "").strip()
152+
kanji_town = (row[8] or "").strip()
153+
code = format_code(zip7)
154+
if not code:
155+
bad_codes += 1
156+
continue
157+
if code in seen:
158+
continue
159+
seen.add(code)
160+
161+
record = {
162+
"code": code,
163+
"country_id": int(jp["id"]),
164+
"country_code": "JP",
165+
}
166+
state = resolve_state(kanji_pref, state_lookup)
167+
if state is not None:
168+
record["state_id"] = int(state["id"])
169+
if state.get("iso2"):
170+
record["state_code"] = state["iso2"]
171+
else:
172+
unresolved_prefs[kanji_pref] = unresolved_prefs.get(kanji_pref, 0) + 1
173+
174+
# Build a human-readable locality "City, Town" using kanji.
175+
# KEN_ALL uses placeholder text like "以下に掲載がない場合"
176+
# ("the following is not listed") for catch-all entries; treat
177+
# those as no town so the locality_name reads cleanly.
178+
if kanji_town and kanji_town in {"以下に掲載がない場合", "(注)"}:
179+
locality = kanji_city
180+
elif kanji_city and kanji_town:
181+
locality = f"{kanji_city}{kanji_town}"
182+
else:
183+
locality = kanji_city or kanji_town
184+
if locality:
185+
record["locality_name"] = locality
186+
187+
record["type"] = "full"
188+
record["source"] = "japan-post"
189+
records.append(record)
190+
191+
records.sort(key=lambda r: (r["code"], r.get("locality_name", "")))
192+
193+
with_state = sum(1 for r in records if "state_id" in r)
194+
print(f"Records: {len(records):,}")
195+
print(f" with state: {with_state:,} ({with_state*100//max(1,len(records))}%)")
196+
print(f" bad codes: {bad_codes:,}")
197+
if unresolved_prefs:
198+
print(f" unresolved prefectures: {sorted(unresolved_prefs.items(), key=lambda kv: -kv[1])[:5]}")
199+
200+
if args.dry_run:
201+
return 0
202+
203+
target = project_root / "contributions/postcodes/JP.json"
204+
if target.exists():
205+
with target.open(encoding="utf-8") as f:
206+
existing = json.load(f)
207+
seen_pairs = {(r["code"], (r.get("locality_name") or "").lower()) for r in existing}
208+
merged = list(existing)
209+
for r in records:
210+
key = (r["code"], (r.get("locality_name") or "").lower())
211+
if key not in seen_pairs:
212+
merged.append(r)
213+
seen_pairs.add(key)
214+
merged.sort(key=lambda r: (r["code"], r.get("locality_name", "")))
215+
else:
216+
merged = records
217+
218+
with target.open("w", encoding="utf-8") as f:
219+
json.dump(merged, f, ensure_ascii=False, indent=2)
220+
f.write("\n")
221+
size_mb = target.stat().st_size / (1024 * 1024)
222+
print(f"\n[OK] Wrote {target.relative_to(project_root)} ({len(merged):,} rows, {size_mb:.1f} MB)")
223+
return 0
224+
225+
226+
if __name__ == "__main__":
227+
raise SystemExit(main())

0 commit comments

Comments
 (0)