Skip to content

feat(postcodes/TW): 371 Chunghwa Post codes (#1039)#1501

Merged
dr5hn merged 1 commit into
masterfrom
feat/postcodes-taiwan
May 5, 2026
Merged

feat(postcodes/TW): 371 Chunghwa Post codes (#1039)#1501
dr5hn merged 1 commit into
masterfrom
feat/postcodes-taiwan

Conversation

@dr5hn
Copy link
Copy Markdown
Owner

@dr5hn dr5hn commented May 2, 2026

Summary

  • Imports Taiwan's 3-digit Chunghwa Post postal-area codes (371 rows) for Can we add a postcode for this?Β #1039
  • 100% state FK resolution across all 22 CSC TW states
  • Fixes TW regex ^\d{5}$ β†’ ^\d{3}(\d{2,3})?$ to accept the three Taiwan postcode generations

Source

Regex fix

Taiwan has used three postcode generations:

Era Form Example
pre-2020 3-digit area code 100
intermediate 3+2 = 5-digit 10001
2020+ (Chunghwa current) 3+3 = 6-digit 100002

This dataset is the 3-digit form. Old regex would have failed all three modern forms.

Edge cases handled

Source label Mapped to Reason
'Taoyuan City City' TAO source typo
'Diaoyutai' ILA Yilan Senkaku/Diaoyu β€” ROC administers under Yilan County
'Dongsha Islands, Nanhai Islands' KHH Kaohsiung Pratas β€” administered by Kaohsiung
'Nansha Islands, Nanhai Islands' KHH Kaohsiung Spratlys β€” administered by Kaohsiung

Distribution (top 5)

iso2 division rows
KHH Kaohsiung 40
TNN Tainan 37
PIF Pingtung County 33
NWT New Taipei 29
TXG Taichung 29

All 22 CSC TW states covered.

Test plan

  • python3 -m py_compile bin/scripts/sync/import_taiwan_postcodes.py
  • All 371 codes match ^\d{3}(\d{2,3})?$
  • 100% state_id valid; state.country_id == 216; state_code == state.iso2
  • No auto-managed fields (id, created_at, updated_at, flag)
  • Idempotent merge (re-run produces no diff)

πŸ€– Generated with Claude Code

Adds Taiwan's 3-digit postal-area codes from the eagle-tw-open-data
mirror (Chunghwa Post / Ministry of Interior open data).

Why
---
Closes the TW gap on issue #1039. The 3-digit area codes are the
historic Chunghwa Post format and remain the most-cited form across
Taiwan government open data.

Coverage
--------
- 371 codes / 100% state FK resolution
- All 22 CSC TW states covered

State FK strategy
-----------------
Source has Chinese (city+district) and English (district + city/county)
labels. Importer parses the trailing 'X City' / 'X County' from the
English column and resolves via 22-entry ENGLISH_TO_ISO2.

Edge cases
----------
- Source typo 'Taoyuan City City' (1 row) -> TAO
- Disputed islands without CSC iso2 entry mapped via SPECIAL_LABEL_TO_ISO2:
  'Diaoyutai' (Senkaku/Diaoyu)        -> ILA (ROC administers under Yilan)
  'Dongsha Islands, Nanhai Islands'   -> KHH (Pratas, under Kaohsiung)
  'Nansha Islands, Nanhai Islands'    -> KHH (Spratlys, under Kaohsiung)

Encoding
--------
Source ships **CP950 / Big5** β€” UTF-8 read produces mojibake. Importer
explicitly decodes as cp950.

Regex fix
---------
Before this PR, countries.json had TW regex `^\d{5}$` (5-digit) which
never matched the dataset's 3-digit codes (or Chunghwa's modern 6-digit
3+3 codes). Updated to `^\d{3}(\d{2,3})?$` to accept all three
generations: 3-digit (this dataset), 5-digit (intermediate), 6-digit
(2020+ canonical).

License
-------
GPL-3.0 (unusual for data; redistribution permitted with attribution;
flagged here per #1039 license-tier policy). Each row carries
`source: "chunghwa-post-via-eagle-tw-open-data"`.

Validation
----------
- python3 -m py_compile passes
- 100% regex match (^\d{3}(\d{2,3})?$)
- 100% state_id valid + state.country_id == 216 + state_code agrees
- No auto-managed fields (id, created_at, updated_at, flag)

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@dosubot dosubot Bot added the size:XS This PR changes 0-9 lines, ignoring generated files. label May 2, 2026
@dosubot dosubot Bot added the enhancement New feature or request label May 2, 2026
@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented May 2, 2026

CSC Validation Report

PR Format

  • βœ… Description provided
  • βœ… Data source linked
  • βœ… Issue linked (recommended for data changes)
  • βœ… Justification / context provided

Labels applied: data:countries, data:postcodes, large-contribution

⚠️ Large Contribution

This PR contains 621 records. Large contributions require manual review.

Schema Validation (621 records)

Errors (blocking):

  • ❌ contributions/countries/countries.json: Record 1 ("Afghanistan"): "id" must not be included (auto-managed)
  • ❌ contributions/countries/countries.json: Record 1 ("Afghanistan"): "created_at" must not be included (auto-managed)
  • ❌ contributions/countries/countries.json: Record 1 ("Afghanistan"): "updated_at" must not be included (auto-managed)
  • ❌ contributions/countries/countries.json: Record 1 ("Afghanistan"): "flag" must not be included (auto-managed)
  • ❌ contributions/countries/countries.json: Record 2 ("Aland Islands"): "id" must not be included (auto-managed)
  • ❌ contributions/countries/countries.json: Record 2 ("Aland Islands"): "created_at" must not be included (auto-managed)
  • ❌ contributions/countries/countries.json: Record 2 ("Aland Islands"): "updated_at" must not be included (auto-managed)
  • ❌ contributions/countries/countries.json: Record 2 ("Aland Islands"): "flag" must not be included (auto-managed)
  • ❌ contributions/countries/countries.json: Record 3 ("Albania"): "id" must not be included (auto-managed)
  • ❌ contributions/countries/countries.json: Record 3 ("Albania"): "created_at" must not be included (auto-managed)
  • ❌ contributions/countries/countries.json: Record 3 ("Albania"): "updated_at" must not be included (auto-managed)
  • ❌ contributions/countries/countries.json: Record 3 ("Albania"): "flag" must not be included (auto-managed)
  • ❌ contributions/countries/countries.json: Record 4 ("Algeria"): "id" must not be included (auto-managed)
  • ❌ contributions/countries/countries.json: Record 4 ("Algeria"): "created_at" must not be included (auto-managed)
  • ❌ contributions/countries/countries.json: Record 4 ("Algeria"): "updated_at" must not be included (auto-managed)
  • ❌ contributions/countries/countries.json: Record 4 ("Algeria"): "flag" must not be included (auto-managed)
  • ❌ contributions/countries/countries.json: Record 5 ("American Samoa"): "id" must not be included (auto-managed)
  • ❌ contributions/countries/countries.json: Record 5 ("American Samoa"): "created_at" must not be included (auto-managed)
  • ❌ contributions/countries/countries.json: Record 5 ("American Samoa"): "updated_at" must not be included (auto-managed)
  • ❌ contributions/countries/countries.json: Record 5 ("American Samoa"): "flag" must not be included (auto-managed)
  • ...and 980 more errors

Warnings:

  • ⚠️ contributions/countries/countries.json: Record 1 ("Afghanistan"): unknown field "population"
  • ⚠️ contributions/countries/countries.json: Record 1 ("Afghanistan"): unknown field "gdp"
  • ⚠️ contributions/countries/countries.json: Record 1 ("Afghanistan"): unknown field "area_sq_km"
  • ⚠️ contributions/countries/countries.json: Record 1 ("Afghanistan"): unknown field "postal_code_format"
  • ⚠️ contributions/countries/countries.json: Record 1 ("Afghanistan"): unknown field "postal_code_regex"
  • ⚠️ contributions/countries/countries.json: Record 2 ("Aland Islands"): unknown field "population"
  • ⚠️ contributions/countries/countries.json: Record 2 ("Aland Islands"): unknown field "gdp"
  • ⚠️ contributions/countries/countries.json: Record 2 ("Aland Islands"): unknown field "area_sq_km"
  • ⚠️ contributions/countries/countries.json: Record 2 ("Aland Islands"): unknown field "postal_code_format"
  • ⚠️ contributions/countries/countries.json: Record 2 ("Aland Islands"): unknown field "postal_code_regex"
  • ...and 1240 more warnings

Cross-Reference Validation

βœ… 742 reference(s) verified

Duplicate Detection

  • ⚠️ contributions/countries/countries.json: Record 1 ("Afghanistan") appears to be a duplicate of existing "Afghanistan" (id: 1, distance: 0.0km)
  • ⚠️ contributions/countries/countries.json: Record 2 ("Aland Islands") appears to be a duplicate of existing "Aland Islands" (id: 2, distance: 0.0km)
  • ⚠️ contributions/countries/countries.json: Record 3 ("Albania") appears to be a duplicate of existing "Albania" (id: 3, distance: 0.0km)
  • ⚠️ contributions/countries/countries.json: Record 4 ("Algeria") appears to be a duplicate of existing "Algeria" (id: 4, distance: 0.0km)
  • ⚠️ contributions/countries/countries.json: Record 5 ("American Samoa") appears to be a duplicate of existing "American Samoa" (id: 5, distance: 0.0km)
  • ⚠️ contributions/countries/countries.json: Record 6 ("Andorra") appears to be a duplicate of existing "Andorra" (id: 6, distance: 0.0km)
  • ⚠️ contributions/countries/countries.json: Record 7 ("Angola") appears to be a duplicate of existing "Angola" (id: 7, distance: 0.0km)
  • ⚠️ contributions/countries/countries.json: Record 8 ("Anguilla") appears to be a duplicate of existing "Anguilla" (id: 8, distance: 0.0km)
  • ⚠️ contributions/countries/countries.json: Record 9 ("Antarctica") appears to be a duplicate of existing "Antarctica" (id: 9, distance: 0.0km)
  • ⚠️ contributions/countries/countries.json: Record 10 ("Antigua and Barbuda") appears to be a duplicate of existing "Antigua and Barbuda" (id: 10, distance: 0.0km)

Source URL Verification

βœ… 2 source URL(s) accessible


❌ 1000 error(s), 1500 warning(s) | Status: Changes required

Please fix the errors above and push a new commit. Refer to our Contribution Guidelines for details.

Copy link
Copy Markdown
Owner Author

dr5hn commented May 4, 2026

Weekly data-quality review (2026-05-04)

Verdict: needs-discussion

Checks

  • Schema: βœ… Change to contributions/countries/countries.json updates TW postal_code_format and postal_code_regex. No forbidden fields introduced. New contributions/postcodes/TW.json records correctly omit id, flag, created_at, updated_at. (CI errors are false positives β€” see note below.)
  • FK integrity: βœ… All 371 TW postcode records resolve to valid TW states (per CI cross-reference pass).
  • Coordinates: βœ… No out-of-bounds coordinates reported by CI.
  • Wikidata: N/A.
  • Naming convention: N/A.

Discussion item

postal_code_format value β€” the PR sets this to "###/#####/######". Taiwan postcodes don't include a literal / separator; the three forms (100, 10001, 100002) are just different-length numeric strings. Using / as a format string delimiter is non-standard (other entries use # for digit, @ for letter, ? for alphanumeric, and literal chars for separators that actually appear in the code). Consider either:

  • Three separate values separated by a conventional indicator, e.g. "###|#####|######" if the repo adopts a pipe notation for alternatives, or
  • Just the shortest form "###" with a note in the PR, since the regex covers all three generations.

The regex itself (^(\d{3}(?:\d{2,3})?)$) is correct and accepts all three generations.

Note on CI "needs-changes" label

Same false-positive issue as the other postcode PRs: validator checks all 250 existing countries in countries.json for forbidden fields that legitimately exist on pre-existing records.

πŸ€– Automated weekly review β€” Claude (sonnet-4-6).


Generated by Claude Code

@dr5hn
Copy link
Copy Markdown
Owner Author

dr5hn commented May 5, 2026

Merging as-is. The postal_code_format value "###/#####/######" is non-standard (TW codes don't include a literal /). Will normalise the convention across all multi-format-tier countries in a follow-up PR β€” either pipe notation ###|#####|###### or just the shortest form ###. The regex ^(\d{3}(?:\d{2,3})?)$ correctly accepts all three generations either way.

@dr5hn dr5hn merged commit bf4fef2 into master May 5, 2026
1 check passed
@dr5hn dr5hn deleted the feat/postcodes-taiwan branch May 5, 2026 11:11
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

data:countries data:postcodes enhancement New feature or request large-contribution needs-changes size:XS This PR changes 0-9 lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant