feat(postcodes/JP): bulk-import 120,677 codes via Japan Post KEN_ALL (#1039)#1433
Merged
Conversation
…1039) Largest single-source postcode import yet. Adds: 1. bin/scripts/sync/import_japan_post_postcodes.py — pipeline reading Japan Post's KEN_ALL.CSV (Shift-JIS encoded). Picks one canonical record per unique 7-digit code (KEN_ALL's natural ordering is geographic by JIS municipality code, so the first hit per zip is a stable primary). Resolves prefecture by stripping the 都/道/府/県 suffix from the kanji name before matching against states.native. 2. contributions/postcodes/JP.json — 120,677 codes covering all 47 prefectures with 100% state_id resolution. Format normalisation - KEN_ALL stores 7-digit zips without separator (0600000); regex on countries.json requires "###-####" form (^\\d{3}-\\d{4}\$). Pipeline inserts the hyphen. Locality naming - KEN_ALL splits address into city (col 7) + town (col 8), both kanji - pipeline builds "城+町" composite (e.g. "札幌市中央区旭ケ丘") for normal rows, and falls back to city-only when town is the catch-all placeholder "以下に掲載がない場合" ("none of the below listed") State resolution (100%) - 47 prefectures, all matched - KEN_ALL: 北海道, 青森県, 東京都, 大阪府, etc. - states.json native: 北海道, 青森, 東京, 大阪, etc. (suffix often missing) - Suffix-strip lookup: KEN_ALL '青森県' -> '青森' -> matches state native Validation (zero errors across 120,677 records) - All codes match countries.postal_code_regex (^\\d{3}-\\d{4}\$) - All country_id/state_id foreign keys resolve - All state_code values agree with state.iso2 - No auto-managed fields present License & attribution - Source: Japan Post (free redistribution, including commercial) - Each row: source: "japan-post" File size - 25.7 MB JSON. Larger than India (3.8 MB) or US (7.8 MB) but in-band with the repo's existing precedent (cities/US.json is 22 MB, cities/IT.json is 11 MB). If size becomes a concern, the existing gz-to-Releases pattern (#1374) extends mechanically to postcodes. Refs: #1039 Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Contributor
CSC Validation ReportPR Format
Labels applied:
|
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Largest single-source postcode import yet. Adds:
bin/scripts/sync/import_japan_post_postcodes.py— pipeline reading Japan Post'sKEN_ALL.CSV(Shift-JIS encoded, ~125k rows, ~12 MB raw). Free-redistribution source.contributions/postcodes/JP.json— 120,677 codes covering all 47 prefectures with 100%state_idresolution.How it works
encoding="shift_jis", errors="replace"0600000→060-0000(regex^\d{3}-\d{4}$requires hyphen)都/道/府/県suffix from KEN_ALL kanji (e.g.青森県→青森) and match againststates.native. Two-pass lookup also handles entries that already include the suffix instates.json(北海道, 宮城県, etc.)city + town(e.g.札幌市中央区旭ケ丘); fallback to city-only when town is the catch-all placeholder以下に掲載がない場合("none of the below listed")Validation (zero errors across 120,677 records)
state_idresolvedpostal_code_regex(^\d{3}-\d{4}$)state_code↔state.iso2agreementLicense & attribution
source: "japan-post"for provenanceFile size
25.7 MB JSON. Larger than India (3.8 MB) or US (7.8 MB) but in-band with existing precedent:
contributions/cities/US.jsonis 22 MBcontributions/cities/IT.jsonis 11 MBIf size becomes a concern, the existing
.gz-to-Releases pattern (#1374) extends mechanically to postcodes.Cumulative postcode coverage after this lands
Refs: #1039