Skip to content

Allow importing areas for external postcode data#4092

Open
AyushDharDubey wants to merge 3 commits into
osm-search:masterfrom
AyushDharDubey:issue_4080
Open

Allow importing areas for external postcode data#4092
AyushDharDubey wants to merge 3 commits into
osm-search:masterfrom
AyushDharDubey:issue_4080

Conversation

@AyushDharDubey
Copy link
Copy Markdown
Contributor

@AyushDharDubey AyushDharDubey commented May 10, 2026

Summary

Closes #4080
This PR introduces support for importing external postcode data in both csv and jsonl (json lines) formats, allowing for richer postcode geometry import. In JSONL files, we can now provide full geometries, and centroids (optional).

The new JSONL files would need to placed in the project directory with name XX_postcodes_geometry.jsonl.

Currently OSM imported postcodes precede over jsonl imports which precedes csv imports.

The import logic, storage, and documentation are updated accordingly, and comprehensive tests are added for the new functionality.

AI Usage

LLM based CLI tools were extensively used to understand the codebase and find out the relevant files pertaining the targeted changes.

Contributor guidelines (mandatory)

  • I have adhered to the coding style
  • I have tested the proposed changes
  • I have disclosed above any use of AI to generate code, documentation, or the pull request description

@AyushDharDubey AyushDharDubey changed the title Enhance: Allow importing areas for external postcode data Allow importing areas for external postcode data May 10, 2026
@AyushDharDubey
Copy link
Copy Markdown
Contributor Author

I have conserved the support for the existing csv import workflow alongside the new jsonl geometry support because for most countries (including my own🫠️), the centroid-tagged postcode dataset is rather dominant in open space than a maturely marked boundary dataset.

@AyushDharDubey AyushDharDubey marked this pull request as ready for review May 10, 2026 23:02
@lonvia
Copy link
Copy Markdown
Member

lonvia commented May 11, 2026

I've skimmed through the PR and realise that we need to discuss a bit more how exactly the areas play with the other postcode sources.

Right now the order is as follows: first postcodes areas from OSM are imported. They are considered authoritative. That means postcode points from OSM with the same postcode are ignored and so are any postcode point inside such an area. Then the remaining postcode points are used to compute centroids. The area for such a centroid is a rectangle according to extent. Finally, postcodes are added from the external source but only if they don't appear at all in the OSM data.

If we now allows postcode areas in the external data, then it would make sense to assume that their area is authoritative as well. So the order would be something along those lines:

  1. Import postcode areas from OSM.
  2. Add postcode areas from external data for postcodes not found as areas in OSM.
  3. Add postcode centroids from OSM and external data as before but also using external postcode areas as exclusion zones/postcodes.

The other thing I noticed is that in your reading of GeoJSON, the geometry can be optional. That doesn't quite match with its specification. A geometry must always be present. So the rules for interpretation should go somewhere along this:

  • if geometry is POINT, interpret as postcode centroid. Also parse extent if present.
  • If geometry is POLYGON or MULTIPOLYGON, use as exact postcode area. Also parse lat/lon if present and use as centroid. Ignore extent.
  • If geometry is any other geometry type, ignore as invalid.

@AyushDharDubey
Copy link
Copy Markdown
Contributor Author

Thanks for the context. It will overhaul this PR completely.

Now, please correct me if I’m deviating here:

Instead of allowing POINT geometry in JSONL, we could rather restrict JSONL files to POLYGON and MULTIPOLYGON geometries only, and treat this file as authoritative, on par with OSM postcode area data (only if the postcode area is not already present from OSM).

If a user has a repository of postcode centroids, they can always import it through the existing CSV import-flow.

the CSV files could continue to be handled by _update_guessed_postcode() & JSONL files would be handled by a new method executed after _update_postcode_areas() (OSM postcode import). ?

@lonvia
Copy link
Copy Markdown
Member

lonvia commented May 13, 2026

Instead of allowing POINT geometry in JSONL, we could rather restrict JSONL files to POLYGON and MULTIPOLYGON geometries only, and treat this file as authoritative, on par with OSM postcode area data (only if the postcode area is not already present from OSM).

Hmm, yes, I see how this would make things less confusing and the code a lot easier. I'd then give the file a different name though, not just a different extension. XX_postcode_areas.jsonl, for example.

Question still remains how they interact with the areas from OSM itself. Possible solutions that come to mind:

  1. Only import when no area with the same postcode exists.
  2. Only import the parts where they do not overlap with postcode areas from OSM. (Potentially making the area smaller to avoid overlaps.)
  3. Make the external postcodes completely authoritative and ignore all postcodes from OSM. (i.e. skipping over the entire postcode generation process for the given country and just import the external ones)

The third point is something that will be requested as a feature sooner or later. So it remains to decide, if there is a use-case for 1) or 2). @mtmail, any opinion?

@lonvia
Copy link
Copy Markdown
Member

lonvia commented May 20, 2026

I'd say lets go with option 1) then for now. We can always add a command-line switch for 3) later.

todo_countries: set[str] = set()

with conn.cursor() as cur:
cur.execute("""SELECT DISTINCT country_code FROM placex
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Getting them from placex is really expensive. Use the country_name table instead which has the full list of countries that Nominatim can handle.

" Ignored.", self.country)
return
postcode = analyzer.normalize_postcode(row['postcode'])
try:
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any particular reason why you moved this block out of the if of line 261?

else:
return open(fname, 'r', encoding='utf-8')

try:
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You seem to have removed the try/finally block. This is necessary when working with file descriptors without a with context. There might be other reasons something throws in the code except the parts that are explicitly caught right now.

Also needed above in _update_from_external_geometry().

_PostcodeCollector(country, matcher.get_matcher(country),
matcher.get_postcode_extent(country),
exclude=area_pcs[country]).commit(conn, analyzer, project_dir,
is_initial, True)
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think it is worth reusing the PostcodeCollector class here. It makes the code of the class a lot more complicated and you don't need the core functionality of this class at all (collecting lots of points into a single centroid). Can you give it a try and see if something more similar to _update_postcode_areas() possibly with slightly changing _insert_postcode_areas() to make it work for that use case would simplify the code?

with conn.cursor() as cur:
cur.execute("""SELECT country_code, postcode
FROM location_postcodes WHERE osm_id is not null
OR is_authoritative
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In #4067 I went for looking if the geometry is a rectangle to determine if it is made from an area of not. I see how this is a bit hacky, so I don't mind adding an additional column instead. However, I would rather go for a more explicit is_area column which is also set to true for the postcode areas from OSM. Area or not is the information that we are after here, after all. Two things to keep in mind with the new column: A new column needs a migration (see https://github.com/osm-search/Nominatim/blob/master/src/nominatim_db/tools/migration.py) and you should adapt the code from #4067 to make proper use of the new column.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Allow importing areas for external postcodes

2 participants