Allow importing areas for external postcode data#4092
Conversation
|
I have conserved the support for the existing csv import workflow alongside the new jsonl geometry support because for most countries (including my own🫠️), the centroid-tagged postcode dataset is rather dominant in open space than a maturely marked boundary dataset. |
|
I've skimmed through the PR and realise that we need to discuss a bit more how exactly the areas play with the other postcode sources. Right now the order is as follows: first postcodes areas from OSM are imported. They are considered authoritative. That means postcode points from OSM with the same postcode are ignored and so are any postcode point inside such an area. Then the remaining postcode points are used to compute centroids. The area for such a centroid is a rectangle according to extent. Finally, postcodes are added from the external source but only if they don't appear at all in the OSM data. If we now allows postcode areas in the external data, then it would make sense to assume that their area is authoritative as well. So the order would be something along those lines:
The other thing I noticed is that in your reading of GeoJSON, the geometry can be optional. That doesn't quite match with its specification. A geometry must always be present. So the rules for interpretation should go somewhere along this:
|
|
Thanks for the context. It will overhaul this PR completely. Now, please correct me if I’m deviating here: Instead of allowing If a user has a repository of postcode centroids, they can always import it through the existing CSV import-flow. the CSV files could continue to be handled by |
Hmm, yes, I see how this would make things less confusing and the code a lot easier. I'd then give the file a different name though, not just a different extension. Question still remains how they interact with the areas from OSM itself. Possible solutions that come to mind:
The third point is something that will be requested as a feature sooner or later. So it remains to decide, if there is a use-case for 1) or 2). @mtmail, any opinion? |
|
I'd say lets go with option 1) then for now. We can always add a command-line switch for 3) later. |
975dca9 to
3c65be6
Compare
| todo_countries: set[str] = set() | ||
|
|
||
| with conn.cursor() as cur: | ||
| cur.execute("""SELECT DISTINCT country_code FROM placex |
There was a problem hiding this comment.
Getting them from placex is really expensive. Use the country_name table instead which has the full list of countries that Nominatim can handle.
| " Ignored.", self.country) | ||
| return | ||
| postcode = analyzer.normalize_postcode(row['postcode']) | ||
| try: |
There was a problem hiding this comment.
Any particular reason why you moved this block out of the if of line 261?
| else: | ||
| return open(fname, 'r', encoding='utf-8') | ||
|
|
||
| try: |
There was a problem hiding this comment.
You seem to have removed the try/finally block. This is necessary when working with file descriptors without a with context. There might be other reasons something throws in the code except the parts that are explicitly caught right now.
Also needed above in _update_from_external_geometry().
| _PostcodeCollector(country, matcher.get_matcher(country), | ||
| matcher.get_postcode_extent(country), | ||
| exclude=area_pcs[country]).commit(conn, analyzer, project_dir, | ||
| is_initial, True) |
There was a problem hiding this comment.
I don't think it is worth reusing the PostcodeCollector class here. It makes the code of the class a lot more complicated and you don't need the core functionality of this class at all (collecting lots of points into a single centroid). Can you give it a try and see if something more similar to _update_postcode_areas() possibly with slightly changing _insert_postcode_areas() to make it work for that use case would simplify the code?
| with conn.cursor() as cur: | ||
| cur.execute("""SELECT country_code, postcode | ||
| FROM location_postcodes WHERE osm_id is not null | ||
| OR is_authoritative |
There was a problem hiding this comment.
In #4067 I went for looking if the geometry is a rectangle to determine if it is made from an area of not. I see how this is a bit hacky, so I don't mind adding an additional column instead. However, I would rather go for a more explicit is_area column which is also set to true for the postcode areas from OSM. Area or not is the information that we are after here, after all. Two things to keep in mind with the new column: A new column needs a migration (see https://github.com/osm-search/Nominatim/blob/master/src/nominatim_db/tools/migration.py) and you should adapt the code from #4067 to make proper use of the new column.
Summary
Closes #4080
This PR introduces support for importing external postcode data in both csv and jsonl (json lines) formats, allowing for richer postcode geometry import. In JSONL files, we can now provide full geometries, and centroids (optional).
The new JSONL files would need to placed in the project directory with name
XX_postcodes_geometry.jsonl.Currently OSM imported postcodes precede over jsonl imports which precedes csv imports.
The import logic, storage, and documentation are updated accordingly, and comprehensive tests are added for the new functionality.
AI Usage
LLM based CLI tools were extensively used to understand the codebase and find out the relevant files pertaining the targeted changes.
Contributor guidelines (mandatory)