fix(IT): restore corrupted native fields (#1349 follow-up)#1397
Closed
dr5hn wants to merge 1 commit into
Closed
Conversation
Past machine-translation runs polluted the native field for many IT cities (e.g. Pero -> native "Ma", Postal -> native "Postale", Panchià -> native "Possono agganciare", Pareto -> native "Libbra" which is "pound"). The name field already holds the canonical Italian form (Pomigliano d'Arco, Sant'Ambrogio di Torino, etc.), so where the city's name matches an ISTAT comune, native is now copied from name. Cities whose name is not an ISTAT comune (~2,500 frazioni) are left untouched — no authoritative replacement exists. Counts: 9947 input -> 6070 already correct, 2499 no ISTAT match, 1378 restored. Stacks on top of #1395 (the city remap PR). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
4 tasks
Owner
Author
Weekly data-quality review (2026-04-27)Verdict: clean Checks
Advisory (non-blocking)
🤖 Automated weekly review — Claude (sonnet-4-6). Generated by Claude Code |
dr5hn
added a commit
that referenced
this pull request
Apr 27, 2026
…#1397/#1399) The remap is a behavior change for downstream consumers — region-level state_code queries (e.g. Sicily=82, Lombardy=25) now return empty arrays because cities live under provinces/metropolitan cities, not regions. Documents the traversal pattern (states.parent_id) needed for region-aggregate queries so users know how to migrate.
This was referenced Apr 27, 2026
dr5hn
added a commit
that referenced
this pull request
Apr 27, 2026
…p) (#1479) Dropped the legacy half of each pair, keeping the ISTAT-canonical (or English-name, per repo convention) record: id 58976 'Pozzaglio' -> kept 58977 'Pozzaglio ed Uniti' id 61329 'Torino' -> kept 61575 'Turin' id 61530 'Trinità d\'Agultu' -> kept 61531 'Trinità d\'Agultu e Vignola' id 139215 'Inverno' -> kept 139216 'Inverno e Monteleone' id 139523 'Limite' -> kept 136799 'Capraia e Limite' id 140714 'Napoli' -> kept 140713 'Naples' Two pairs are intentionally NOT touched and require maintainer review, since neither record carries the ISTAT-canonical merged name: - MN: 'Sermide' (id 60744) + 'Felonica' (id 138474) canonical comune is 'Sermide e Felonica' (since 2017). - PV: 'Corteolona' (id 138065) + 'Genzone' (id 138905) canonical comune is 'Corteolona e Genzone' (since 2018). Stacks on top of #1395 and #1397. Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
dr5hn
added a commit
that referenced
this pull request
Apr 27, 2026
…#1397/#1399) The remap is a behavior change for downstream consumers — region-level state_code queries (e.g. Sicily=82, Lombardy=25) now return empty arrays because cities live under provinces/metropolitan cities, not regions. Documents the traversal pattern (states.parent_id) needed for region-aggregate queries so users know how to migrate.
dr5hn
added a commit
that referenced
this pull request
Apr 27, 2026
…n) (#1352 PR-C) (#1392) * feat(postcodes/DK): bulk-import 1,089 codes via DAWA (#1039) Adds Danish postcodes via DAWA (Danmarks Adressers Web API) — public sector data published under CC-0 by SDFI/Dataforsyningen. 1. bin/scripts/sync/import_denmark_postcodes.py — pipeline that fetches /kommuner to build a kommune-code -> region-name map, then resolves each /postnumre record's region via its first kommune. Maps the 5 Danish region names to states.json iso2 codes: Region Hovedstaden -> 84 (called "Denmark" in states.json) Region Sjælland -> 85 (Zealand) Region Syddanmark -> 83 (Southern Denmark) Region Midtjylland -> 82 (Central Denmark) Region Nordjylland -> 81 (North Denmark) 2. contributions/postcodes/DK.json — 1,089 codes covering all 5 regions with 100% state_id + 100% coordinate resolution. Validation (zero errors) - All codes match countries.postal_code_regex (^(\\d{4})\$) - All FKs resolve, all state_codes agree with state.iso2 License & attribution - Source: SDFI / Dataforsyningen DAWA (CC-0) - Each row: source: "dawa" Refs: #1039 Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * feat(postcodes/IS): bulk-import 195 codes via iceaddr (#1039) Adds Icelandic postcodes via the sveinbjornt/iceaddr Python package which embeds the canonical postcode metadata under MIT licence. 1. bin/scripts/sync/import_iceland_postcodes.py — pipeline that dynamically imports the iceaddr POSTCODES dict and resolves each code's region via prefix range to states.json iso2 1-8 (Statistics Iceland's NUTS-3 boundaries: 1xx-2xx Capital, 3xx Western, 4xx Westfjords, 5xx Northwestern, 6xx Northeastern, 7xx Eastern, 8xx-9xx Southern). 2. contributions/postcodes/IS.json — 195 records with 100% state_id resolution. Locality names combine stadur_nf + lysing (e.g. "Reykjavík, Miðborg"). License & attribution - Source: iceaddr (MIT) embedding Pósturinn data - Each row: source: "iceaddr" Refs: #1039 Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * feat(postcodes/SK+RO+SI): batch-import 15,585 codes via 3 community mirrors (#1039) Bundles three small-to-medium European countries with confirmed redistributable postcode mirrors into a single batch importer. 1. bin/scripts/sync/import_eu_batch1_postcodes.py — pipeline that ingests three different shapes (SK JSON, RO CSV, SI CSV) and writes per-country JSON files. ASCII-folding + dash-to-space normalisation handles the Romanian Caraș-Severin / Bistrița-Năsăud cases where the CSV uses spaces and states.json uses hyphens. 2. contributions/postcodes/SK.json — 1,312 records (100% state via KRAJ -> states.iso2 direct match) 3. contributions/postcodes/RO.json — 13,751 records (100% state via ASCII-folded judet name match; all 6 Bucharest sectors mapped to 'B') 4. contributions/postcodes/SI.json — 522 records, country-only by design (source has no municipality info; SI postcodes don't map cleanly to administrative regions) Validation (zero errors) - All codes match countries.postal_code_regex - All FKs resolve, all state_codes agree with state.iso2 License & attribution - SK source: github.com/FeroVolar/PSC-JSON (community Slovenská pošta data) - RO source: github.com/alexionegit/coduripostaleRomaniaPS - SI source: github.com/dlabs/postcode_si (community Posta Slovenije data) Refs: #1039 Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * docs(changelog): add notable callout for IT city→province remap (#1395/#1397/#1399) The remap is a behavior change for downstream consumers — region-level state_code queries (e.g. Sicily=82, Lombardy=25) now return empty arrays because cities live under provinces/metropolitan cities, not regions. Documents the traversal pattern (states.parent_id) needed for region-aggregate queries so users know how to migrate. * docs: multi-level territories policy (FR overseas, dual representation) (#1352 PR-C) Adds MULTI_LEVEL_TERRITORIES.md documenting why 12 French overseas territories (and analogous US/CN/NO entities) appear simultaneously as ISO 3166-1 countries and as ISO 3166-2 subdivisions of their parent state. Captures the maintainer's Option C decision on #1352: keep both representations because (1) downstream API/SDK consumers filter on country_code, (2) ISO 3166-1 lists them as countries, and (3) the breaking change is unjustified for a labelling concern. Cross-links the new policy doc from .claude/CLAUDE.md (Important Rules) and README.md (contributing section). No data changes. Refs: #1352 Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Stacks on top of #1395. Targets
feat/issue-1349-italy-city-remapso the diff stays focused; rebase to master once #1395 lands.Refs #1349 — discovered while writing the city remap. Many
nativevalues incontributions/cities/IT.jsonwere corrupted by a past machine-translation run, e.g.:The
namefield already holds the canonical Italian form for these comuni (Pomigliano d'Arco,Sant'Ambrogio di Torino, etc., all with proper apostrophes and accents). Where a city'snamematches an ISTAT comune, this PR setsnative = name. Cities whosenameis not an ISTAT comune (≈2,500 frazioni) are intentionally left untouched — no authoritative replacement is available.Counts
Commits
fix(IT): restore 1,378 corrupted native fields (#1349 follow-up)— scriptitaly_restore_native.py+ bundled report + IT.json data diff.Test plan
python3 bin/scripts/fixes/italy_restore_native.py --dry-runreports 0 changes after this PR (idempotent).jq '[.[] | select(.name == .native)] | length' contributions/cities/IT.jsonreturns ≥ 7,448 (was 6,070).python3 -m json.tool contributions/cities/IT.jsonparses cleanly.validate-schema/validate-cross-reference/validate-coordinatespass.🤖 Generated with Claude Code