Performance Improvements by a-regal · Pull Request #59 · EL-BID/urbanpy

a-regal · 2026-05-23T14:20:46Z

Performance improvements with minimal changes across modules

Before moving further with publishing in conda and other improvements to the package in general, I think a general QOL update to our packages' performance was important. I went through our modules one by one with Claude Code to figure out where we can gain the most with the least changes. In total, five per-module commits on top of master with the following breakdown:

54d6d90 fix/perf(download): vectorize point construction, fix hdx_dataset typos
b30960b fix/perf(utils): no-op fillna fallback in overpass_to_gdf; one-pass relation polys
3f93ce2 perf(accessibility): vectorize friction, cache centroids in travel_times
0fd8f2c perf(routing): batch OSRM via /table/, pooled Session, single-source Dijkstra in isochrones
d6aa459 perf(geom): deduplicate hex IDs before building polygons; one-pass stats writes; union_all

Benchmarks (synthetic data, best of 5)

Path	Old	New	Speedup
`accessibility.friction` (200k rows)	2103.6 ms	8.9 ms	237×
`geom.osmnx_coefficient_computation` write pattern (2k × 4)	917.6 ms	4.5 ms	202×
`routing.compute_osrm_dist_matrix` (10 × 10, local stub)	117.3 ms	1.6 ms	74× ¹
`download.get_hdx_dataset` Point construction (400k rows)	5231.2 ms	96.7 ms	54×
`accessibility.travel_times` centroid handling (20k units)	577.0 ms	18.3 ms	31×
`routing.isochrone_from_graph` (2k-node graph, 3 × 3)	572.3 ms	145.2 ms	3.9×
`utils.process_overpass_relations` (5k ways)	143.8 ms	111.4 ms	1.3×
`geom.gen_hexagons` (res-8 over 0.5°²)	103.7 ms	98.6 ms	1.05× ²
`geom.merge_geom_downloads` (6 overlapping gdfs)	2.22 ms	2.18 ms	≈ ³

¹ Local in-process HTTP stub — with a real OSRM server the gap is dominated by N·M HTTP round-trips, so this is a lower bound.
² Bottleneck is h3.geo_to_cells / cell_to_boundary themselves; new code's main win is avoiding duplicate Polygon construction on multi-part inputs.
³ union_all is the same speed as unary_union — change exists to clear the shapely 2 deprecation warning.

Correctness fixes shipped alongside

utils.overpass_to_gdf: the per-key NaN fallback loop discarded the result of Series.fillna() (no assignment, no inplace=True), so secondary tag keys never populated poi_type. Now correctly fills from fallback keys.
download.hdx_dataset: had resource.ends_with(...) / resource.starts_with(...) (AttributeError on the first call) and an unbound hdx_url = hdx_url branch. Both fixed.
np.NaN → np.nan and k in tag.keys() → tag.get(k, np.nan) across utils.py and download.py.

Per-module change detail

`geom` (commit `d6aa459`)

gen_hexagons: deduplicate H3 cell IDs across (multi)polygon parts before materializing shapely Polygons, dropping the post-hoc drop_duplicates(). Equal output for single-polygon cities; avoids redundant Polygon construction on multi-part inputs.
osmnx_coefficient_computation: accumulate per-row stats into a list of dicts and assign once. Replaces O(N·K) .loc[i, col] = ... writes that forced repeated reindex / dtype upcasts.
merge_geom_downloads (+ plotting.choropleth_map): switch deprecated GeoSeries.unary_union to GeoSeries.union_all() (shapely 2 path).

`routing` (commit `0fd8f2c`)

compute_osrm_dist_matrix: replace nested per-pair /route/ requests with a single OSRM /table/v1/ call; preserves the per-pair fallback for older OSRM builds.
Module-level requests.Session with urllib3 Retry + connection pooling, reused by osrm_route, ors_api, isochrone_from_api. Removes per-call TCP/TLS handshake overhead and the ad-hoc time.sleep retry.
isochrone_from_graph: run nx.single_source_dijkstra_path_length once per center and threshold the node set for each trip_time, instead of re-running ego_graph for every (center, time) pair. Also swap the GeoSeries(...).unary_union.convex_hull round-trip for a direct shapely.MultiPoint hull.
type(x) == gpd.GeoSeries → isinstance(x, gpd.GeoSeries).

`accessibility` (commit `3f93ce2`)

friction: accept array input via np.where; preserves scalar behavior.
hu_access_map: replace the two progress_apply(scalar friction(...), axis=1) passes with numpy distance computation on cached x/y arrays.
travel_times: compute geometry.centroid once and reuse for both nn_search inputs and OSRM calls; pull nearest POI geometries up front instead of recomputing centroid + pois.iloc per row inside progress_apply.

`utils` (commit `b30960b`)

overpass_to_gdf: assign the result of Series.fillna() back to gdf["poi_type"]. Switch tag[k] if k in tag.keys() else np.NaN → tag.get(k, np.nan).
process_overpass_relations: replace three sequential .apply() passes (shell → length filter → Polygon) with a single Python loop, and call shapely.make_valid on the underlying array instead of per-row apply.

`download` (commit `54d6d90`)

get_hdx_dataset: df.apply(lambda r: Point(...), axis=1) → gpd.points_from_xy on the filtered slice; .copy() the slice; set crs="EPSG:4326" on the result.
overpass_pois: same tag.get(...) rewrite as utils.
hdx_dataset: .ends_with/.starts_with → .endswith/.startswith; fix unbound hdx_url = hdx_url branch.

Bench methodology

Each comparison runs the old and new implementations side-by-side in the same process against synthetic data sized to make per-call overhead visible (e.g. 200k / 400k / 2k rows). Reported numbers are best of 5. Network-dependent paths (OSRM /route, /table, Overpass) are exercised against a local in-process HTTP stub so we measure code-side overhead — the real-world gap on OSRM /table will be substantially larger because the per-pair path pays N·M HTTP round-trips.

…ites; union_all - gen_hexagons: dedup H3 cell IDs across (multi)polygon parts before materializing shapely Polygons, dropping the post-hoc drop_duplicates. Equal output for single-polygon cities; ~1.05x and avoids redundant Polygon construction on multi-part inputs. - osmnx_coefficient_computation: accumulate per-row stats into a list of dicts and assign once; replaces O(N*K) .loc[i, col] = ... that forced repeated reindex/upcasts. 917 ms -> 4.5 ms on 2000 rows x 4 stats (~200x). - merge_geom_downloads (+ plotting.choropleth_map): switch deprecated GeoSeries.unary_union to GeoSeries.union_all() (shapely 2 path). Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

…Dijkstra in isochrones - compute_osrm_dist_matrix: replace nested per-pair /route/ requests with a single OSRM /table/v1/ call; preserves the per-pair fallback for older OSRM builds. Local stub: 117 ms -> 1.6 ms for a 10x10 matrix (~74x). With a real OSRM server the gap widens further as the cost is dominated by N*M HTTP round-trips. - Add a module-level requests.Session with urllib3 Retry + connection pooling, reused by osrm_route, ors_api, isochrone_from_api. Removes per-call TCP/TLS handshake overhead and the ad-hoc time.sleep retry. - isochrone_from_graph: run nx.single_source_dijkstra_path_length once per center and threshold the node set for each trip_time, instead of re-running ego_graph for every (center, time) pair. Also swap the GeoSeries(...).unary_union.convex_hull round-trip for a direct shapely.MultiPoint hull. 572 ms -> 145 ms on a 2k-node graph (~3.9x). - type(x) == gpd.GeoSeries -> isinstance(x, gpd.GeoSeries). Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

- friction: accept array input via np.where; preserves scalar behavior. - hu_access_map: replace the two progress_apply(scalar friction(...), axis=1) passes with numpy distance computation on cached x/y arrays. Synthetic 200k-row distance frame: 2104 ms -> 8.9 ms (~237x). - travel_times: compute geometry.centroid once and reuse for both nn_search inputs and OSRM calls; pull nearest POI geometries up front instead of recomputing centroid + pois.iloc per row inside progress_apply. Centroid handling alone: 577 ms -> 18 ms on a 20k unit gdf (~31x); the remaining cost is the OSRM HTTP call itself (separate /table/ work in routing). Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

…elation polys - overpass_to_gdf: the per-key NaN fallback loop discarded the result of Series.fillna() (no assignment, no inplace=True), so secondary tag keys never populated 'poi_type'. Assign the result back. Also switch tag[k] if k in tag.keys() else np.NaN -> tag.get(k, np.nan) (faster lookup, no-op .keys() removed, np.NaN deprecated alias). - process_overpass_relations: replace three sequential .apply() passes (shell -> length filter -> Polygon) with a single Python loop, and call shapely.make_valid on the underlying array instead of per-row apply. 5k-way payload: 144 ms -> 111 ms (~1.3x); also avoids a transient 'shell' column and the deprecated GeoSeries-level apply. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

- get_hdx_dataset: replace df.apply(lambda r: Point(...), axis=1) with gpd.points_from_xy on the filtered slice; also .copy() the slice to avoid SettingWithCopyWarning, and set crs="EPSG:4326" on the resulting GeoDataFrame. 400k rows: 5231 ms -> 97 ms (~54x). - overpass_pois: tag[k] if k in tag.keys() else np.NaN -> tag.get(k, np.nan) (faster, np.NaN is deprecated, redundant .keys() removed). - hdx_dataset: fix str.ends_with/starts_with -> .endswith/.startswith (the originals would AttributeError on the first call) and the unbound 'hdx_url = hdx_url' branch. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

Claudio9701

LGTM. One comment for a follow up related to the OSRM the user should be able to set a custom port if needed. we have port 5000 hardcoded in many places

a-regal and others added 5 commits May 20, 2026 17:52

a-regal requested a review from Claudio9701 May 23, 2026 14:20

Claudio9701 approved these changes May 23, 2026

View reviewed changes

Claudio9701 merged commit ddd0360 into master May 23, 2026
12 of 19 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Performance Improvements#59

Performance Improvements#59
Claudio9701 merged 5 commits into
masterfrom
install-recipes

a-regal commented May 23, 2026

Uh oh!

Claudio9701 left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

a-regal commented May 23, 2026

Performance improvements with minimal changes across modules

Benchmarks (synthetic data, best of 5)

Correctness fixes shipped alongside

Per-module change detail

geom (commit d6aa459)

routing (commit 0fd8f2c)

accessibility (commit 3f93ce2)

utils (commit b30960b)

download (commit 54d6d90)

Bench methodology

Uh oh!

Claudio9701 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

`geom` (commit `d6aa459`)

`routing` (commit `0fd8f2c`)

`accessibility` (commit `3f93ce2`)

`utils` (commit `b30960b`)

`download` (commit `54d6d90`)