Skip to content

feat(postcodes/JP): bulk-import 120,677 codes via Japan Post KEN_ALL (#1039)#1433

Merged
dr5hn merged 1 commit into
masterfrom
feat/postcodes-japan-bulk
Apr 27, 2026
Merged

feat(postcodes/JP): bulk-import 120,677 codes via Japan Post KEN_ALL (#1039)#1433
dr5hn merged 1 commit into
masterfrom
feat/postcodes-japan-bulk

Conversation

@dr5hn
Copy link
Copy Markdown
Owner

@dr5hn dr5hn commented Apr 27, 2026

Summary

Largest single-source postcode import yet. Adds:

  1. bin/scripts/sync/import_japan_post_postcodes.py — pipeline reading Japan Post's KEN_ALL.CSV (Shift-JIS encoded, ~125k rows, ~12 MB raw). Free-redistribution source.

  2. contributions/postcodes/JP.json120,677 codes covering all 47 prefectures with 100% state_id resolution.

How it works

Step Detail
Encoding KEN_ALL is Shift-JIS; pipeline opens with encoding="shift_jis", errors="replace"
Format normalisation KEN_ALL zips like 0600000060-0000 (regex ^\d{3}-\d{4}$ requires hyphen)
Dedupe KEN_ALL has one row per (zip, town); pipeline picks the FIRST occurrence per zip (geographic JIS-code ordering gives a stable primary)
Prefecture resolution Strip 都/道/府/県 suffix from KEN_ALL kanji (e.g. 青森県青森) and match against states.native. Two-pass lookup also handles entries that already include the suffix in states.json (北海道, 宮城県, etc.)
Locality name Build kanji city + town (e.g. 札幌市中央区旭ケ丘); fallback to city-only when town is the catch-all placeholder 以下に掲載がない場合 ("none of the below listed")

Validation (zero errors across 120,677 records)

Check Result
Records 120,677
state_id resolved 100% (47/47 prefectures matched)
Codes matching postal_code_regex (^\d{3}-\d{4}$)
FK resolution
state_codestate.iso2 agreement
No auto-managed fields

License & attribution

  • Source: Japan Post (free redistribution permitted, including commercial use; no formal licence text)
  • Each row: source: "japan-post" for provenance

File size

25.7 MB JSON. Larger than India (3.8 MB) or US (7.8 MB) but in-band with existing precedent:

  • contributions/cities/US.json is 22 MB
  • contributions/cities/IT.json is 11 MB

If size becomes a concern, the existing .gz-to-Releases pattern (#1374) extends mechanically to postcodes.

Cumulative postcode coverage after this lands

Country Codes Source
JP (this PR) 120,677 Japan Post
US (#1432) 33,791 US Census
IN (#1430) 19,100 India Post
AT (#1431) 18,722 OpenPLZ
DE (#1431) 12,815 OpenPLZ
CH (#1431) 4,059 OpenPLZ
All earlier ~80 manual
Total ~209,000

Refs: #1039

…1039)

Largest single-source postcode import yet. Adds:

1. bin/scripts/sync/import_japan_post_postcodes.py — pipeline reading
   Japan Post's KEN_ALL.CSV (Shift-JIS encoded). Picks one canonical
   record per unique 7-digit code (KEN_ALL's natural ordering is
   geographic by JIS municipality code, so the first hit per zip is a
   stable primary). Resolves prefecture by stripping the 都/道/府/県
   suffix from the kanji name before matching against states.native.

2. contributions/postcodes/JP.json — 120,677 codes covering all 47
   prefectures with 100% state_id resolution.

Format normalisation
- KEN_ALL stores 7-digit zips without separator (0600000); regex on
  countries.json requires "###-####" form (^\\d{3}-\\d{4}\$). Pipeline
  inserts the hyphen.

Locality naming
- KEN_ALL splits address into city (col 7) + town (col 8), both kanji
- pipeline builds "城+町" composite (e.g. "札幌市中央区旭ケ丘") for normal
  rows, and falls back to city-only when town is the catch-all
  placeholder "以下に掲載がない場合" ("none of the below listed")

State resolution (100%)
- 47 prefectures, all matched
- KEN_ALL: 北海道, 青森県, 東京都, 大阪府, etc.
- states.json native: 北海道, 青森, 東京, 大阪, etc. (suffix often missing)
- Suffix-strip lookup: KEN_ALL '青森県' -> '青森' -> matches state native

Validation (zero errors across 120,677 records)
- All codes match countries.postal_code_regex (^\\d{3}-\\d{4}\$)
- All country_id/state_id foreign keys resolve
- All state_code values agree with state.iso2
- No auto-managed fields present

License & attribution
- Source: Japan Post (free redistribution, including commercial)
- Each row: source: "japan-post"

File size
- 25.7 MB JSON. Larger than India (3.8 MB) or US (7.8 MB) but in-band
  with the repo's existing precedent (cities/US.json is 22 MB,
  cities/IT.json is 11 MB). If size becomes a concern, the existing
  gz-to-Releases pattern (#1374) extends mechanically to postcodes.

Refs: #1039

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Copilot AI review requested due to automatic review settings April 27, 2026 08:36
@dosubot dosubot Bot added the size:XS This PR changes 0-9 lines, ignoring generated files. label Apr 27, 2026
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copilot wasn't able to review any files in this pull request.

@github-actions
Copy link
Copy Markdown
Contributor

CSC Validation Report

PR Format

  • ✅ Description provided
  • ✅ Data source linked
  • ✅ Issue linked (recommended for data changes)
  • ✅ Justification / context provided

Labels applied: data:postcodes, large-contribution

⚠️ Large Contribution

This PR contains 120677 records. Large contributions require manual review.

Schema Validation (120677 records)

✅ All records passed validation

Cross-Reference Validation

✅ 241354 reference(s) verified


All checks passed | Status: Ready for review

@dosubot dosubot Bot added the enhancement New feature or request label Apr 27, 2026
@dr5hn dr5hn merged commit 0d778c2 into master Apr 27, 2026
1 check passed
@dr5hn dr5hn deleted the feat/postcodes-japan-bulk branch April 27, 2026 08:45
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

data:postcodes enhancement New feature or request large-contribution ready-for-review size:XS This PR changes 0-9 lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants