Skip to content

fix(IT): drop 6 duplicate pairs flagged by remap (#1349 follow-up)#1399

Closed
dr5hn wants to merge 1 commit into
fix/issue-1349-italy-native-fieldfrom
fix/issue-1349-italy-dedup-flagged-pairs
Closed

fix(IT): drop 6 duplicate pairs flagged by remap (#1349 follow-up)#1399
dr5hn wants to merge 1 commit into
fix/issue-1349-italy-native-fieldfrom
fix/issue-1349-italy-dedup-flagged-pairs

Conversation

@dr5hn
Copy link
Copy Markdown
Owner

@dr5hn dr5hn commented Apr 25, 2026

Stacks on #1395#1397. Targets fix/issue-1349-italy-native-field so the diff stays focused; rebase to master once the upstream PRs land.

Refs #1349 — implements the maintainer cleanup from the duplicate list in the city remap.

Drops (6 records)

Drop id Drop name Kept id Kept name Reason
58976 Pozzaglio 58977 Pozzaglio ed Uniti ISTAT canonical merged comune (Q42226)
61329 Torino 61575 Turin English-name = repo convention; Turin row has population data, Torino row is null
61530 Trinità d'Agultu 61531 Trinità d'Agultu e Vignola ISTAT canonical merged comune (Q341096)
139215 Inverno 139216 Inverno e Monteleone ISTAT canonical merged comune (Q39917)
139523 Limite 136799 Capraia e Limite ISTAT canonical merged comune (Q82639)
140714 Napoli 140713 Naples English-name = repo convention; Naples row has correct main-city coords + population

City count: 9,947 → 9,941.

NOT dropped — flagged for maintainer

Two pairs require manual handling because neither half carries the modern ISTAT-canonical name:

Province id name population wikiDataId
MN 60744 Sermide 3599 (none)
MN 138474 Felonica 1025 Q42389

Canonical since 2017: Sermide e Felonica. Recommended: rename id 60744 to "Sermide e Felonica", set its native accordingly, drop id 138474.

Province id name population wikiDataId
PV 138065 Corteolona 2023 Q40346
PV 138905 Genzone 358 Q39697

Canonical since 2018: Corteolona e Genzone. Recommended: rename id 138065 to "Corteolona e Genzone", set native, drop id 138905.

These weren't auto-handled because renaming is irreversible and shouldn't be done without explicit signoff on the chosen kept-id.

Preconditions

The script italy_dedup_flagged_pairs.py verifies that every drop-target's current name matches the expected name and every keep-target exists with the expected name before mutating IT.json. If the IT data has shifted underneath, the script aborts with exit code 2 — no silent drops.

Test plan

  • python3 bin/scripts/fixes/italy_dedup_flagged_pairs.py --dry-run reports Dropped this run: 0 after this PR (idempotent).
  • jq 'length' contributions/cities/IT.json → 9941.
  • jq '[.[] | select(.id == 58976 or .id == 61329 or .id == 61530 or .id == 139215 or .id == 139523 or .id == 140714)] | length' contributions/cities/IT.json → 0.
  • CI's validate-schema, validate-cross-reference, validate-coordinates, detect-duplicates pass.

🤖 Generated with Claude Code

Dropped the legacy half of each pair, keeping the ISTAT-canonical (or
English-name, per repo convention) record:

  id 58976  'Pozzaglio'              -> kept 58977  'Pozzaglio ed Uniti'
  id 61329  'Torino'                 -> kept 61575  'Turin'
  id 61530  'Trinità d\'Agultu'      -> kept 61531  'Trinità d\'Agultu e Vignola'
  id 139215 'Inverno'                -> kept 139216 'Inverno e Monteleone'
  id 139523 'Limite'                 -> kept 136799 'Capraia e Limite'
  id 140714 'Napoli'                 -> kept 140713 'Naples'

Two pairs are intentionally NOT touched and require maintainer review,
since neither record carries the ISTAT-canonical merged name:
  - MN: 'Sermide' (id 60744) + 'Felonica' (id 138474)
        canonical comune is 'Sermide e Felonica' (since 2017).
  - PV: 'Corteolona' (id 138065) + 'Genzone' (id 138905)
        canonical comune is 'Corteolona e Genzone' (since 2018).

Stacks on top of #1395 and #1397.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Copy link
Copy Markdown
Owner Author

dr5hn commented Apr 27, 2026

Weekly data-quality review (2026-04-27)

Verdict: clean

Checks

  • Schema: ✅ Deletions only; no new records. The 6 dropped records were existing entries with valid schema.
  • FK integrity: ✅ No other city references the dropped IDs (58976, 61329, 61530, 139215, 139523, 140714) — safe to delete. Kept records already existed and are unaffected.
  • Coordinates: N/A (no coordinate changes)
  • Wikidata: N/A (no Wikidata field changes; note that ids 61329 "Torino" and 140714 "Napoli" shared the same wikiDataId as their kept counterparts Q495/Q2634 — correct reason to drop the duplicates)
  • Naming convention: ✅ Repo English-name convention correctly applied: "Turin" kept over "Torino"; "Naples" kept over "Napoli". ISTAT canonical merged-comune names kept ("Pozzaglio ed Uniti", "Capraia e Limite", "Trinità d'Agultu e Vignola", "Inverno e Monteleone") over legacy partial-name rows.

Advisory (non-blocking)

  • 2 ambiguous pairs left open — Sermide/Felonica (MN) and Corteolona/Genzone (PV): neither row in each pair carries the modern ISTAT canonical merged name ("Sermide e Felonica", "Corteolona e Genzone"). Maintainer sign-off needed on which ID to rename before the other can be dropped.
  • The precondition guards in italy_dedup_flagged_pairs.py (abort if expected names have shifted) are good practice — keep them in place.

🤖 Automated weekly review — Claude (sonnet-4-6).


Generated by Claude Code

dr5hn added a commit that referenced this pull request Apr 27, 2026
…#1397/#1399)

The remap is a behavior change for downstream consumers — region-level
state_code queries (e.g. Sicily=82, Lombardy=25) now return empty arrays
because cities live under provinces/metropolitan cities, not regions.
Documents the traversal pattern (states.parent_id) needed for
region-aggregate queries so users know how to migrate.
@dr5hn dr5hn marked this pull request as ready for review April 27, 2026 14:46
Copilot AI review requested due to automatic review settings April 27, 2026 14:46
@dosubot dosubot Bot added the size:XS This PR changes 0-9 lines, ignoring generated files. label Apr 27, 2026
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Removes six known duplicate Italy city records (flagged during the #1349 remap follow-up) to keep contributions/cities/IT.json consistent and reduce duplicate comune entries.

Changes:

  • Dropped 6 duplicate city records from contributions/cities/IT.json (keeping the canonical/English-name counterparts).
  • Added a dedicated fix script (italy_dedup_flagged_pairs.py) that verifies preconditions and performs the drops (idempotent via “already gone” skips).

Reviewed changes

Copilot reviewed 1 out of 2 changed files in this pull request and generated no comments.

File Description
contributions/cities/IT.json Deletes the 6 duplicate city records listed in the PR description.
bin/scripts/fixes/italy_dedup_flagged_pairs.py Adds an idempotent fix script with safety checks to perform the same targeted deletions.

@dr5hn
Copy link
Copy Markdown
Owner Author

dr5hn commented Apr 27, 2026

Superseded by #1479 (rebased onto current master after #1395 + #1397 squash-merged).

@dr5hn dr5hn closed this Apr 27, 2026
dr5hn added a commit that referenced this pull request Apr 27, 2026
…#1397/#1399)

The remap is a behavior change for downstream consumers — region-level
state_code queries (e.g. Sicily=82, Lombardy=25) now return empty arrays
because cities live under provinces/metropolitan cities, not regions.
Documents the traversal pattern (states.parent_id) needed for
region-aggregate queries so users know how to migrate.
dr5hn added a commit that referenced this pull request Apr 27, 2026
…n) (#1352 PR-C) (#1392)

* feat(postcodes/DK): bulk-import 1,089 codes via DAWA (#1039)

Adds Danish postcodes via DAWA (Danmarks Adressers Web API) — public
sector data published under CC-0 by SDFI/Dataforsyningen.

1. bin/scripts/sync/import_denmark_postcodes.py — pipeline that fetches
   /kommuner to build a kommune-code -> region-name map, then resolves
   each /postnumre record's region via its first kommune. Maps the 5
   Danish region names to states.json iso2 codes:
     Region Hovedstaden -> 84 (called "Denmark" in states.json)
     Region Sjælland    -> 85 (Zealand)
     Region Syddanmark  -> 83 (Southern Denmark)
     Region Midtjylland -> 82 (Central Denmark)
     Region Nordjylland -> 81 (North Denmark)

2. contributions/postcodes/DK.json — 1,089 codes covering all 5 regions
   with 100% state_id + 100% coordinate resolution.

Validation (zero errors)
- All codes match countries.postal_code_regex (^(\\d{4})\$)
- All FKs resolve, all state_codes agree with state.iso2

License & attribution
- Source: SDFI / Dataforsyningen DAWA (CC-0)
- Each row: source: "dawa"

Refs: #1039

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* feat(postcodes/IS): bulk-import 195 codes via iceaddr (#1039)

Adds Icelandic postcodes via the sveinbjornt/iceaddr Python package
which embeds the canonical postcode metadata under MIT licence.

1. bin/scripts/sync/import_iceland_postcodes.py — pipeline that
   dynamically imports the iceaddr POSTCODES dict and resolves each
   code's region via prefix range to states.json iso2 1-8 (Statistics
   Iceland's NUTS-3 boundaries: 1xx-2xx Capital, 3xx Western, 4xx
   Westfjords, 5xx Northwestern, 6xx Northeastern, 7xx Eastern,
   8xx-9xx Southern).

2. contributions/postcodes/IS.json — 195 records with 100% state_id
   resolution. Locality names combine stadur_nf + lysing
   (e.g. "Reykjavík, Miðborg").

License & attribution
- Source: iceaddr (MIT) embedding Pósturinn data
- Each row: source: "iceaddr"

Refs: #1039

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* feat(postcodes/SK+RO+SI): batch-import 15,585 codes via 3 community mirrors (#1039)

Bundles three small-to-medium European countries with confirmed
redistributable postcode mirrors into a single batch importer.

1. bin/scripts/sync/import_eu_batch1_postcodes.py — pipeline that
   ingests three different shapes (SK JSON, RO CSV, SI CSV) and writes
   per-country JSON files. ASCII-folding + dash-to-space normalisation
   handles the Romanian Caraș-Severin / Bistrița-Năsăud cases where
   the CSV uses spaces and states.json uses hyphens.

2. contributions/postcodes/SK.json — 1,312 records (100% state via
   KRAJ -> states.iso2 direct match)
3. contributions/postcodes/RO.json — 13,751 records (100% state via
   ASCII-folded judet name match; all 6 Bucharest sectors mapped to 'B')
4. contributions/postcodes/SI.json — 522 records, country-only by
   design (source has no municipality info; SI postcodes don't map
   cleanly to administrative regions)

Validation (zero errors)
- All codes match countries.postal_code_regex
- All FKs resolve, all state_codes agree with state.iso2

License & attribution
- SK source: github.com/FeroVolar/PSC-JSON (community Slovenská pošta data)
- RO source: github.com/alexionegit/coduripostaleRomaniaPS
- SI source: github.com/dlabs/postcode_si (community Posta Slovenije data)

Refs: #1039

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* docs(changelog): add notable callout for IT city→province remap (#1395/#1397/#1399)

The remap is a behavior change for downstream consumers — region-level
state_code queries (e.g. Sicily=82, Lombardy=25) now return empty arrays
because cities live under provinces/metropolitan cities, not regions.
Documents the traversal pattern (states.parent_id) needed for
region-aggregate queries so users know how to migrate.

* docs: multi-level territories policy (FR overseas, dual representation) (#1352 PR-C)

Adds MULTI_LEVEL_TERRITORIES.md documenting why 12 French overseas
territories (and analogous US/CN/NO entities) appear simultaneously as
ISO 3166-1 countries and as ISO 3166-2 subdivisions of their parent state.

Captures the maintainer's Option C decision on #1352: keep both
representations because (1) downstream API/SDK consumers filter on
country_code, (2) ISO 3166-1 lists them as countries, and (3) the
breaking change is unjustified for a labelling concern.

Cross-links the new policy doc from .claude/CLAUDE.md (Important Rules)
and README.md (contributing section).

No data changes.

Refs: #1352

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

---------

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

fixed Issue has been fixed size:XS This PR changes 0-9 lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants