Skip to content

Performance issue in registry.py lines 809-814 #175

@sk818

Description

@sk818

GeoTessera Performance Report: Slow Registry Initialization

Summary

GeoTessera() initialization takes 14-22 seconds due to inefficient row-by-row iteration when building the tile lookup index. This can be reduced to ~2 seconds with vectorized operations.

Environment

  • geotessera version: (installed via pip)
  • Python: 3.13
  • Platform: macOS (Darwin 24.3.0, Apple Silicon)
  • Registry size: 2,258,858 tiles (179 MB parquet file)

Profiling Results

$ python -m cProfile -s cumulative

110,542,680 function calls in 26.316 seconds

   ncalls  tottime  cumtime  filename:lineno(function)
        1    0.000   26.316  registry.py:592(__init__)
        1    2.828   17.316  registry.py:699(_load_registry)
  7704674    4.050   16.340  registry.py:234(coord_to_grid_int)  <-- BOTTLENECK
  7704674    1.303   11.952  numpy/fromnumeric.py:3618(round)
        1    1.382    9.000  registry.py:816(_load_landmasks_registry)

Key finding: coord_to_grid_int() is called 7.7 million times (3.4x per row for lon, lat, and year columns), consuming 16+ seconds.

Root Cause

In registry.py lines 809-814, the tile index is built using row-by-row iteration:

# Current implementation (SLOW)
self._tile_index: Dict[Tuple[int, int, int], int] = {}
for idx, row in enumerate(self._registry_gdf.itertuples()):
    lon_i = int(coord_to_grid_int(row.lon))   # np.round() called per row
    lat_i = int(coord_to_grid_int(row.lat))   # np.round() called per row
    key = (int(row.year), lon_i, lat_i)
    self._tile_index[key] = idx

This pattern is slow because:

  1. .itertuples() has Python loop overhead for 2.2M rows
  2. coord_to_grid_int() calls np.round() on scalar values (no vectorization)
  3. Dictionary construction happens one key at a time

Suggested Fix

Replace with vectorized pandas/numpy operations:

# Vectorized implementation (FAST)
def _build_tile_index(self):
    """Build O(1) lookup index using vectorized operations."""
    df = self._registry_gdf

    # Vectorized coordinate conversion (entire column at once)
    lon_i = (df['lon'] * 100).round().astype('int32')
    lat_i = (df['lat'] * 100).round().astype('int32')
    years = df['year'].astype('int32')

    # Build dictionary from arrays
    self._tile_index = {
        (y, lo, la): idx
        for idx, (y, lo, la) in enumerate(zip(years, lon_i, lat_i))
    }

Or even faster using pd.Series operations:

# Alternative: Create composite key column
df = self._registry_gdf
lon_i = (df['lon'] * 100).round().astype('int32')
lat_i = (df['lat'] * 100).round().astype('int32')

# Create tuple keys as a series, then convert to dict
keys = list(zip(df['year'].astype('int32'), lon_i, lat_i))
self._tile_index = dict(zip(keys, range(len(keys))))

Performance Comparison

Implementation Time Speedup
Current (itertuples + scalar np.round) ~14s 1x
Vectorized pandas operations ~1-2s 7-14x

Similar Issue in _load_landmasks_registry

The same pattern exists around lines 890-895:

for idx, row in enumerate(self._landmasks_df.itertuples()):
    lon_i = int(coord_to_grid_int(row.lon))
    lat_i = int(coord_to_grid_int(row.lat))
    key = (lon_i, lat_i)
    self._landmasks_index[key] = idx

This should also be vectorized for consistency.

Benchmark Script

import time
from pathlib import Path

# Test current implementation
from geotessera import GeoTessera

start = time.time()
gt = GeoTessera(
    embeddings_dir='/path/to/embeddings',
    registry_dir=Path.home() / '.cache' / 'geotessera'
)
print(f'GeoTessera init: {time.time() - start:.1f}s')

Impact

This initialization delay affects:

  • CLI tools that create new GeoTessera instances
  • Web applications with per-request initialization
  • Batch processing scripts

Recommendation

  1. Replace row-by-row iteration with vectorized operations in _load_registry() and _load_landmasks_registry()
  2. Consider lazy-loading the tile index (only build when first needed)
  3. Consider caching the tile index to disk alongside the parquet file

Report generated: 2026-02-06
Tested with registry containing 2,258,858 tiles

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions