This issue details the codebase audit, refactoring roadmap, and feature specification for implementing advanced spatial analysis on AIS tracks.
1. Codebase Audit & Refactoring Tasks
We will clean up the existing codebase to eliminate soft fallbacks, silent errors, and defensive programming, replacing them with strict assertions and immediate failure when data does not conform to expectations.
A. data_loader.py
- Current Issues:
- Implements soft conversion fallback in
convert_to_gdf checking for a "Shape" column.
- Quietly handles cases where the geometry coordinates are missing.
- Automatically calculates spatial partitions on the fly with a warning instead of demanding them.
- Only prints a warning when the CRS is not
EPSG:3857.
- Refactoring Steps:
- Demand that the input is a valid GeoParquet file. If CRS is not
EPSG:3857, raise a ValueError immediately.
- Require that spatial partitions are already calculated and stored in metadata. If they are absent, fail immediately.
- Eliminate the
convert_to_gdf fallback function from the loader.
B. preprocessing.py
- Current Issues:
- Catches general exceptions when trying to read GeoParquet and silently falls back to WKB conversion.
- Mixes dataset reading, schema transformation, WKB parsing, and reprojection.
- Refactoring Steps:
- Enforce a single canonical input format.
- Remove the inline try-except block that falls back to WKB conversion.
- Create a separate, dedicated converter utility/command in the CLI if WKB conversion is needed, keeping the main preprocessing module clean and focused on spatial partitioning.
C. renderer.py
- Current Issues:
- Catches general exceptions in
render_tile_task to print tracebacks and raise them, cluttering worker logs.
- Soft checks like
if category_column in gdf_local.columns: instead of assuming metadata and schema are correct.
- Soft-assigns names to arrays:
if not da.name: da.name = "counts".
- Refactoring Steps:
- Remove defensive try-except blocks from the task handlers; let Dask capture and report failures natively.
- Validate the
category_column presence at startup, not per-tile. If it is missing from the schema, fail immediately.
- Strictly enforce that the data variable name is
"counts".
D. postprocessing.py
- Current Issues:
- Contains extremely defensive fallback logic to find data variables (e.g., searching for
"counts", then "__xarray_dataarray_variable__", and finally defaulting to the first non-spatial variable).
- Catches general exceptions in
export_single_cog and returns None (silent/soft failure).
- Catches exceptions during cleanup.
- Refactoring Steps:
- Enforce that the Zarr dataset contains a
"counts" data variable. If it does not exist, raise a KeyError immediately.
- Eliminate all try-except blocks that swallow raster writing or configuration errors. Let the program crash if writing a COG or deleting temporary files fails.
2. Reference Data Sources
- Non-Public Data Isolation: Some of the data in
~/data/ais is non-public and confidential. Only public source data should be referred to.
- Commit & Log Compliance: Under no circumstances should any references to non-public data sources, names, or private locations be written in git commits, git logs, or issue descriptions. All tests, examples, and documentation must only refer to public sources or synthetic data.
- Waterway Geometries Source: The canonical, public waterway centerline geometries can be obtained from the EURIS waterway network on Zenodo: EURIS Waterway Network (Zenodo).
- Target Analysis Datasets:
- Cross Sections:
PassageLine_NL_20260224.geojson (found under ~/src/fis/output/euris-export/).
- Fairway Edges:
edges.geoparquet (found under ~/src/fis/output/euris-graph/ or similar graph folders).
- Lock Chambers:
LockChamberArea_NL_20260224.geojson (representing lock chamber polygons in ~/src/fis/output/euris-export/).
3. CRS & Reprojection Strategy
To ensure distance-based calculations (buffer widths in profiles, segment lengths in centerline snapping) are performed in accurate meters, the entire pipeline operates in the metric EPSG:3857 projection:
- CLI/Ingestion Layer:
- Responsible for loading the query geometries (polygons, cross profiles, waterway centerline).
- Normalizes the input geometries by reprojecting them to EPSG:3857 immediately if they are in another CRS (e.g., EPSG:4326).
- Confirms that the loaded AIS geoparquet is in EPSG:3857.
- Core Calculation Layer:
- Strictly requires both datasets to have the matching CRS EPSG:3857.
- Raises a
ValueError immediately if there is a CRS mismatch or if the CRS is not metric (EPSG:3857).
4. Analysis Features & Advanced Algorithms
We will add a new module src/ais_shader/analysis.py containing three specialized algorithms:
A. Passage Line Crossing Analysis
-
Goal: Compute crossing speed histograms for vessels crossing cross-section lines.
-
Algorithm:
- Sort AIS points per vessel (
pseudo_id / MMSI) chronologically.
- For consecutive points $P_t$ and $P_{t+1}$, construct the segment $S_t = \text{LineString}([P_t, P_{t+1}])$.
- Perform a spatial join (
intersects) between these segments and the cross-section lines.
- For intersecting segments:
- Project the intersection point $I$ onto the segment to calculate its fractional distance $f$.
- Linearly interpolate the speed at the crossing: $sog_{cross} = sog_t + f \times (sog_{t+1} - sog_t)$.
- Bin the crossing speeds and append the histogram columns to the original cross-section GeoDataFrame.
B. Fairway Snapping & Binning
- Goal: Project AIS points onto nearby fairway edge linestrings and bin speeds along their lengths.
- Algorithm:
- Find the nearest edge for each AIS point using a spatial index (
gpd.sjoin_nearest within a reasonable search radius).
- For each point snapped to an edge, project it onto that edge using vectorized
line_locate_point to get the distance along the edge.
- Divide each edge into segments of
segment_length (e.g., 100 meters).
- Assign points to their corresponding edge segment, group by segment, and compute speed histograms.
- Return a GeoDataFrame containing the segmented fairway lines with the speed histograms attached.
C. Lock Chamber Visit Duration
-
Goal: Compute visit durations and duration histograms for lock chamber polygons.
-
Algorithm:
- Spatial join (
within) to find all AIS points falling inside each lock chamber.
- Group points by lock chamber ID and vessel ID (
pseudo_id / MMSI), sorted chronologically.
- Partition the points into distinct visits: if the time gap between consecutive points of a vessel in the same lock exceeds a threshold (e.g., 2 hours), split them into separate visits.
- For each visit, compute the duration: $duration = t_{exit} - t_{entry}$.
- Bin these durations (e.g. in minutes:
[0, 15, 30, 45, 60, 90, 120, 180]) and append the count columns to the lock chamber GeoDataFrame.
5. CLI commands
We will add a new CLI command group under ais-shader:
ais-shader analyze-passage: Compute passage line crossings.
ais-shader analyze-fairway: Project and bin speeds along fairway edges.
ais-shader analyze-lock: Compute visit durations inside lock chamber polygons.
All commands will export output files as GeoJSON/GeoPackage for direct visualization in GIS tools.
6. Test Cases & Validation
We will implement the following tests in tests/test_analysis.py:
test_passage_crossing: Create a straight trajectory intersecting a cross-section line. Verify that the crossing speed is interpolated correctly.
test_fairway_snap: Snap random points to parallel line segments, and verify that projection distances along the lines are computed accurately.
test_lock_duration: Create two separate visits of a vessel to a lock chamber separated by a large time gap. Verify they are split into two visits and durations are calculated correctly.
This issue details the codebase audit, refactoring roadmap, and feature specification for implementing advanced spatial analysis on AIS tracks.
1. Codebase Audit & Refactoring Tasks
We will clean up the existing codebase to eliminate soft fallbacks, silent errors, and defensive programming, replacing them with strict assertions and immediate failure when data does not conform to expectations.
A.
data_loader.pyconvert_to_gdfchecking for a"Shape"column.EPSG:3857.EPSG:3857, raise aValueErrorimmediately.convert_to_gdffallback function from the loader.B.
preprocessing.pyC.
renderer.pyrender_tile_taskto print tracebacks and raise them, cluttering worker logs.if category_column in gdf_local.columns:instead of assuming metadata and schema are correct.if not da.name: da.name = "counts".category_columnpresence at startup, not per-tile. If it is missing from the schema, fail immediately."counts".D.
postprocessing.py"counts", then"__xarray_dataarray_variable__", and finally defaulting to the first non-spatial variable).export_single_cogand returnsNone(silent/soft failure)."counts"data variable. If it does not exist, raise aKeyErrorimmediately.2. Reference Data Sources
~/data/aisis non-public and confidential. Only public source data should be referred to.PassageLine_NL_20260224.geojson(found under~/src/fis/output/euris-export/).edges.geoparquet(found under~/src/fis/output/euris-graph/or similar graph folders).LockChamberArea_NL_20260224.geojson(representing lock chamber polygons in~/src/fis/output/euris-export/).3. CRS & Reprojection Strategy
To ensure distance-based calculations (buffer widths in profiles, segment lengths in centerline snapping) are performed in accurate meters, the entire pipeline operates in the metric EPSG:3857 projection:
ValueErrorimmediately if there is a CRS mismatch or if the CRS is not metric (EPSG:3857).4. Analysis Features & Advanced Algorithms
We will add a new module
src/ais_shader/analysis.pycontaining three specialized algorithms:A. Passage Line Crossing Analysis
pseudo_id/MMSI) chronologically.intersects) between these segments and the cross-section lines.B. Fairway Snapping & Binning
gpd.sjoin_nearestwithin a reasonable search radius).line_locate_pointto get the distance along the edge.segment_length(e.g., 100 meters).C. Lock Chamber Visit Duration
within) to find all AIS points falling inside each lock chamber.pseudo_id/MMSI), sorted chronologically.[0, 15, 30, 45, 60, 90, 120, 180]) and append the count columns to the lock chamber GeoDataFrame.5. CLI commands
We will add a new CLI command group under
ais-shader:ais-shader analyze-passage: Compute passage line crossings.ais-shader analyze-fairway: Project and bin speeds along fairway edges.ais-shader analyze-lock: Compute visit durations inside lock chamber polygons.All commands will export output files as GeoJSON/GeoPackage for direct visualization in GIS tools.
6. Test Cases & Validation
We will implement the following tests in
tests/test_analysis.py:test_passage_crossing: Create a straight trajectory intersecting a cross-section line. Verify that the crossing speed is interpolated correctly.test_fairway_snap: Snap random points to parallel line segments, and verify that projection distances along the lines are computed accurately.test_lock_duration: Create two separate visits of a vessel to a lock chamber separated by a large time gap. Verify they are split into two visits and durations are calculated correctly.