-
Notifications
You must be signed in to change notification settings - Fork 26
Open
Labels
enhancementNew feature or requestNew feature or request
Description
GeoTessera Performance Report: Slow Registry Initialization
Summary
GeoTessera() initialization takes 14-22 seconds due to inefficient row-by-row iteration when building the tile lookup index. This can be reduced to ~2 seconds with vectorized operations.
Environment
- geotessera version: (installed via pip)
- Python: 3.13
- Platform: macOS (Darwin 24.3.0, Apple Silicon)
- Registry size: 2,258,858 tiles (179 MB parquet file)
Profiling Results
$ python -m cProfile -s cumulative
110,542,680 function calls in 26.316 seconds
ncalls tottime cumtime filename:lineno(function)
1 0.000 26.316 registry.py:592(__init__)
1 2.828 17.316 registry.py:699(_load_registry)
7704674 4.050 16.340 registry.py:234(coord_to_grid_int) <-- BOTTLENECK
7704674 1.303 11.952 numpy/fromnumeric.py:3618(round)
1 1.382 9.000 registry.py:816(_load_landmasks_registry)
Key finding: coord_to_grid_int() is called 7.7 million times (3.4x per row for lon, lat, and year columns), consuming 16+ seconds.
Root Cause
In registry.py lines 809-814, the tile index is built using row-by-row iteration:
# Current implementation (SLOW)
self._tile_index: Dict[Tuple[int, int, int], int] = {}
for idx, row in enumerate(self._registry_gdf.itertuples()):
lon_i = int(coord_to_grid_int(row.lon)) # np.round() called per row
lat_i = int(coord_to_grid_int(row.lat)) # np.round() called per row
key = (int(row.year), lon_i, lat_i)
self._tile_index[key] = idxThis pattern is slow because:
.itertuples()has Python loop overhead for 2.2M rowscoord_to_grid_int()callsnp.round()on scalar values (no vectorization)- Dictionary construction happens one key at a time
Suggested Fix
Replace with vectorized pandas/numpy operations:
# Vectorized implementation (FAST)
def _build_tile_index(self):
"""Build O(1) lookup index using vectorized operations."""
df = self._registry_gdf
# Vectorized coordinate conversion (entire column at once)
lon_i = (df['lon'] * 100).round().astype('int32')
lat_i = (df['lat'] * 100).round().astype('int32')
years = df['year'].astype('int32')
# Build dictionary from arrays
self._tile_index = {
(y, lo, la): idx
for idx, (y, lo, la) in enumerate(zip(years, lon_i, lat_i))
}Or even faster using pd.Series operations:
# Alternative: Create composite key column
df = self._registry_gdf
lon_i = (df['lon'] * 100).round().astype('int32')
lat_i = (df['lat'] * 100).round().astype('int32')
# Create tuple keys as a series, then convert to dict
keys = list(zip(df['year'].astype('int32'), lon_i, lat_i))
self._tile_index = dict(zip(keys, range(len(keys))))Performance Comparison
| Implementation | Time | Speedup |
|---|---|---|
| Current (itertuples + scalar np.round) | ~14s | 1x |
| Vectorized pandas operations | ~1-2s | 7-14x |
Similar Issue in _load_landmasks_registry
The same pattern exists around lines 890-895:
for idx, row in enumerate(self._landmasks_df.itertuples()):
lon_i = int(coord_to_grid_int(row.lon))
lat_i = int(coord_to_grid_int(row.lat))
key = (lon_i, lat_i)
self._landmasks_index[key] = idxThis should also be vectorized for consistency.
Benchmark Script
import time
from pathlib import Path
# Test current implementation
from geotessera import GeoTessera
start = time.time()
gt = GeoTessera(
embeddings_dir='/path/to/embeddings',
registry_dir=Path.home() / '.cache' / 'geotessera'
)
print(f'GeoTessera init: {time.time() - start:.1f}s')Impact
This initialization delay affects:
- CLI tools that create new GeoTessera instances
- Web applications with per-request initialization
- Batch processing scripts
Recommendation
- Replace row-by-row iteration with vectorized operations in
_load_registry()and_load_landmasks_registry() - Consider lazy-loading the tile index (only build when first needed)
- Consider caching the tile index to disk alongside the parquet file
Report generated: 2026-02-06
Tested with registry containing 2,258,858 tiles
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
enhancementNew feature or requestNew feature or request