Skip to content

fix(IT): restore corrupted native fields (#1349 follow-up)#1397

Closed
dr5hn wants to merge 1 commit into
feat/issue-1349-italy-city-remapfrom
fix/issue-1349-italy-native-field
Closed

fix(IT): restore corrupted native fields (#1349 follow-up)#1397
dr5hn wants to merge 1 commit into
feat/issue-1349-italy-city-remapfrom
fix/issue-1349-italy-native-field

Conversation

@dr5hn
Copy link
Copy Markdown
Owner

@dr5hn dr5hn commented Apr 25, 2026

Stacks on top of #1395. Targets feat/issue-1349-italy-city-remap so the diff stays focused; rebase to master once #1395 lands.

Refs #1349 — discovered while writing the city remap. Many native values in contributions/cities/IT.json were corrupted by a past machine-translation run, e.g.:

name corrupted native meaning
Pero "Ma" "but"
Postal "Postale" adjective form
Pareto "Libbra" "pound" (weight)
Paroldo "Discorso" "speech"
Palata "Ritorno" "return"
Papozze "Flutter" English word
Panchià "Possono agganciare" "they can hook up"
Petit Fenis "Chiede un fieno" "asks for hay"
Pollein "Poolin" garbled
Paspardo "Strakes" English nautical term

The name field already holds the canonical Italian form for these comuni (Pomigliano d'Arco, Sant'Ambrogio di Torino, etc., all with proper apostrophes and accents). Where a city's name matches an ISTAT comune, this PR sets native = name. Cities whose name is not an ISTAT comune (≈2,500 frazioni) are intentionally left untouched — no authoritative replacement is available.

Counts

input              9947
already correct    6070   native already == name
no ISTAT match     2499   left untouched (frazioni)
restored           1378   native rewritten to name

Commits

  1. fix(IT): restore 1,378 corrupted native fields (#1349 follow-up) — script italy_restore_native.py + bundled report + IT.json data diff.

Test plan

  • python3 bin/scripts/fixes/italy_restore_native.py --dry-run reports 0 changes after this PR (idempotent).
  • jq '[.[] | select(.name == .native)] | length' contributions/cities/IT.json returns ≥ 7,448 (was 6,070).
  • python3 -m json.tool contributions/cities/IT.json parses cleanly.
  • CI's validate-schema / validate-cross-reference / validate-coordinates pass.

🤖 Generated with Claude Code

Past machine-translation runs polluted the native field for many IT
cities (e.g. Pero -> native "Ma", Postal -> native "Postale",
Panchià -> native "Possono agganciare", Pareto -> native "Libbra"
which is "pound"). The name field already holds the canonical Italian
form (Pomigliano d'Arco, Sant'Ambrogio di Torino, etc.), so where the
city's name matches an ISTAT comune, native is now copied from name.

Cities whose name is not an ISTAT comune (~2,500 frazioni) are left
untouched — no authoritative replacement exists.

Counts: 9947 input -> 6070 already correct, 2499 no ISTAT match, 1378
restored.

Stacks on top of #1395 (the city remap PR).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Copy link
Copy Markdown
Owner Author

dr5hn commented Apr 27, 2026

Weekly data-quality review (2026-04-27)

Verdict: clean

Checks

  • Schema: ✅ Only the native field modified on existing records; no new records; no auto-managed fields touched.
  • FK integrity: ✅ No FK fields changed.
  • Coordinates: ✅ No coordinate changes.
  • Wikidata: N/A (no Wikidata field changes)
  • Naming convention: ✅ Restores clearly corrupted machine-translated native values to the authoritative ISTAT Italian commune name. Spot-checked examples confirmed: "Ma""Pagnona", "Flutter""Papozze", "Discorso""Paroldo", "Ritorno""Palata", "Strakes""Paspardo". The name field (English/canonical) is untouched throughout.

Advisory (non-blocking)

  • Bilingual municipalities (Alto Adige/Südtirol) — For ~390 German-speaking communes in Trentino-Alto Adige, setting native = name (Italian ISTAT denomination) does not capture the co-official German name (e.g., "Bolzano" vs "Bozen"). This is a pre-existing policy question not introduced by this PR — Italian ISTAT names are still an improvement over garbage values. A follow-up could store the bilingual form (e.g., "Bolzano/Bozen") for these communes, if the repo adopts that convention.
  • 2,499 frazioni untouched — Records without an ISTAT match are intentionally skipped; their native fields may still be stale. Acceptable scope limitation for this PR.

🤖 Automated weekly review — Claude (sonnet-4-6).


Generated by Claude Code

dr5hn added a commit that referenced this pull request Apr 27, 2026
…#1397/#1399)

The remap is a behavior change for downstream consumers — region-level
state_code queries (e.g. Sicily=82, Lombardy=25) now return empty arrays
because cities live under provinces/metropolitan cities, not regions.
Documents the traversal pattern (states.parent_id) needed for
region-aggregate queries so users know how to migrate.
@dr5hn dr5hn deleted the branch feat/issue-1349-italy-city-remap April 27, 2026 14:44
@dr5hn dr5hn closed this Apr 27, 2026
dr5hn added a commit that referenced this pull request Apr 27, 2026
…p) (#1479)

Dropped the legacy half of each pair, keeping the ISTAT-canonical (or
English-name, per repo convention) record:

  id 58976  'Pozzaglio'              -> kept 58977  'Pozzaglio ed Uniti'
  id 61329  'Torino'                 -> kept 61575  'Turin'
  id 61530  'Trinità d\'Agultu'      -> kept 61531  'Trinità d\'Agultu e Vignola'
  id 139215 'Inverno'                -> kept 139216 'Inverno e Monteleone'
  id 139523 'Limite'                 -> kept 136799 'Capraia e Limite'
  id 140714 'Napoli'                 -> kept 140713 'Naples'

Two pairs are intentionally NOT touched and require maintainer review,
since neither record carries the ISTAT-canonical merged name:
  - MN: 'Sermide' (id 60744) + 'Felonica' (id 138474)
        canonical comune is 'Sermide e Felonica' (since 2017).
  - PV: 'Corteolona' (id 138065) + 'Genzone' (id 138905)
        canonical comune is 'Corteolona e Genzone' (since 2018).

Stacks on top of #1395 and #1397.

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
dr5hn added a commit that referenced this pull request Apr 27, 2026
…#1397/#1399)

The remap is a behavior change for downstream consumers — region-level
state_code queries (e.g. Sicily=82, Lombardy=25) now return empty arrays
because cities live under provinces/metropolitan cities, not regions.
Documents the traversal pattern (states.parent_id) needed for
region-aggregate queries so users know how to migrate.
dr5hn added a commit that referenced this pull request Apr 27, 2026
…n) (#1352 PR-C) (#1392)

* feat(postcodes/DK): bulk-import 1,089 codes via DAWA (#1039)

Adds Danish postcodes via DAWA (Danmarks Adressers Web API) — public
sector data published under CC-0 by SDFI/Dataforsyningen.

1. bin/scripts/sync/import_denmark_postcodes.py — pipeline that fetches
   /kommuner to build a kommune-code -> region-name map, then resolves
   each /postnumre record's region via its first kommune. Maps the 5
   Danish region names to states.json iso2 codes:
     Region Hovedstaden -> 84 (called "Denmark" in states.json)
     Region Sjælland    -> 85 (Zealand)
     Region Syddanmark  -> 83 (Southern Denmark)
     Region Midtjylland -> 82 (Central Denmark)
     Region Nordjylland -> 81 (North Denmark)

2. contributions/postcodes/DK.json — 1,089 codes covering all 5 regions
   with 100% state_id + 100% coordinate resolution.

Validation (zero errors)
- All codes match countries.postal_code_regex (^(\\d{4})\$)
- All FKs resolve, all state_codes agree with state.iso2

License & attribution
- Source: SDFI / Dataforsyningen DAWA (CC-0)
- Each row: source: "dawa"

Refs: #1039

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* feat(postcodes/IS): bulk-import 195 codes via iceaddr (#1039)

Adds Icelandic postcodes via the sveinbjornt/iceaddr Python package
which embeds the canonical postcode metadata under MIT licence.

1. bin/scripts/sync/import_iceland_postcodes.py — pipeline that
   dynamically imports the iceaddr POSTCODES dict and resolves each
   code's region via prefix range to states.json iso2 1-8 (Statistics
   Iceland's NUTS-3 boundaries: 1xx-2xx Capital, 3xx Western, 4xx
   Westfjords, 5xx Northwestern, 6xx Northeastern, 7xx Eastern,
   8xx-9xx Southern).

2. contributions/postcodes/IS.json — 195 records with 100% state_id
   resolution. Locality names combine stadur_nf + lysing
   (e.g. "Reykjavík, Miðborg").

License & attribution
- Source: iceaddr (MIT) embedding Pósturinn data
- Each row: source: "iceaddr"

Refs: #1039

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* feat(postcodes/SK+RO+SI): batch-import 15,585 codes via 3 community mirrors (#1039)

Bundles three small-to-medium European countries with confirmed
redistributable postcode mirrors into a single batch importer.

1. bin/scripts/sync/import_eu_batch1_postcodes.py — pipeline that
   ingests three different shapes (SK JSON, RO CSV, SI CSV) and writes
   per-country JSON files. ASCII-folding + dash-to-space normalisation
   handles the Romanian Caraș-Severin / Bistrița-Năsăud cases where
   the CSV uses spaces and states.json uses hyphens.

2. contributions/postcodes/SK.json — 1,312 records (100% state via
   KRAJ -> states.iso2 direct match)
3. contributions/postcodes/RO.json — 13,751 records (100% state via
   ASCII-folded judet name match; all 6 Bucharest sectors mapped to 'B')
4. contributions/postcodes/SI.json — 522 records, country-only by
   design (source has no municipality info; SI postcodes don't map
   cleanly to administrative regions)

Validation (zero errors)
- All codes match countries.postal_code_regex
- All FKs resolve, all state_codes agree with state.iso2

License & attribution
- SK source: github.com/FeroVolar/PSC-JSON (community Slovenská pošta data)
- RO source: github.com/alexionegit/coduripostaleRomaniaPS
- SI source: github.com/dlabs/postcode_si (community Posta Slovenije data)

Refs: #1039

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* docs(changelog): add notable callout for IT city→province remap (#1395/#1397/#1399)

The remap is a behavior change for downstream consumers — region-level
state_code queries (e.g. Sicily=82, Lombardy=25) now return empty arrays
because cities live under provinces/metropolitan cities, not regions.
Documents the traversal pattern (states.parent_id) needed for
region-aggregate queries so users know how to migrate.

* docs: multi-level territories policy (FR overseas, dual representation) (#1352 PR-C)

Adds MULTI_LEVEL_TERRITORIES.md documenting why 12 French overseas
territories (and analogous US/CN/NO entities) appear simultaneously as
ISO 3166-1 countries and as ISO 3166-2 subdivisions of their parent state.

Captures the maintainer's Option C decision on #1352: keep both
representations because (1) downstream API/SDK consumers filter on
country_code, (2) ISO 3166-1 lists them as countries, and (3) the
breaking change is unjustified for a labelling concern.

Cross-links the new policy doc from .claude/CLAUDE.md (Important Rules)
and README.md (contributing section).

No data changes.

Refs: #1352

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

---------

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant