Skip to content

Refactoring Roadmap and AIS Speed Histograms Implementation Plan #2

@SiggyF

Description

@SiggyF

This issue details the codebase audit, refactoring roadmap, and feature specification for implementing advanced spatial analysis on AIS tracks.

1. Codebase Audit & Refactoring Tasks

We will clean up the existing codebase to eliminate soft fallbacks, silent errors, and defensive programming, replacing them with strict assertions and immediate failure when data does not conform to expectations.

A. data_loader.py

  • Current Issues:
    • Implements soft conversion fallback in convert_to_gdf checking for a "Shape" column.
    • Quietly handles cases where the geometry coordinates are missing.
    • Automatically calculates spatial partitions on the fly with a warning instead of demanding them.
    • Only prints a warning when the CRS is not EPSG:3857.
  • Refactoring Steps:
    • Demand that the input is a valid GeoParquet file. If CRS is not EPSG:3857, raise a ValueError immediately.
    • Require that spatial partitions are already calculated and stored in metadata. If they are absent, fail immediately.
    • Eliminate the convert_to_gdf fallback function from the loader.

B. preprocessing.py

  • Current Issues:
    • Catches general exceptions when trying to read GeoParquet and silently falls back to WKB conversion.
    • Mixes dataset reading, schema transformation, WKB parsing, and reprojection.
  • Refactoring Steps:
    • Enforce a single canonical input format.
    • Remove the inline try-except block that falls back to WKB conversion.
    • Create a separate, dedicated converter utility/command in the CLI if WKB conversion is needed, keeping the main preprocessing module clean and focused on spatial partitioning.

C. renderer.py

  • Current Issues:
    • Catches general exceptions in render_tile_task to print tracebacks and raise them, cluttering worker logs.
    • Soft checks like if category_column in gdf_local.columns: instead of assuming metadata and schema are correct.
    • Soft-assigns names to arrays: if not da.name: da.name = "counts".
  • Refactoring Steps:
    • Remove defensive try-except blocks from the task handlers; let Dask capture and report failures natively.
    • Validate the category_column presence at startup, not per-tile. If it is missing from the schema, fail immediately.
    • Strictly enforce that the data variable name is "counts".

D. postprocessing.py

  • Current Issues:
    • Contains extremely defensive fallback logic to find data variables (e.g., searching for "counts", then "__xarray_dataarray_variable__", and finally defaulting to the first non-spatial variable).
    • Catches general exceptions in export_single_cog and returns None (silent/soft failure).
    • Catches exceptions during cleanup.
  • Refactoring Steps:
    • Enforce that the Zarr dataset contains a "counts" data variable. If it does not exist, raise a KeyError immediately.
    • Eliminate all try-except blocks that swallow raster writing or configuration errors. Let the program crash if writing a COG or deleting temporary files fails.

2. Reference Data Sources

  • Non-Public Data Isolation: Some of the data in ~/data/ais is non-public and confidential. Only public source data should be referred to.
  • Commit & Log Compliance: Under no circumstances should any references to non-public data sources, names, or private locations be written in git commits, git logs, or issue descriptions. All tests, examples, and documentation must only refer to public sources or synthetic data.
  • Waterway Geometries Source: The canonical, public waterway centerline geometries can be obtained from the EURIS waterway network on Zenodo: EURIS Waterway Network (Zenodo).
  • Target Analysis Datasets:
    • Cross Sections: PassageLine_NL_20260224.geojson (found under ~/src/fis/output/euris-export/).
    • Fairway Edges: edges.geoparquet (found under ~/src/fis/output/euris-graph/ or similar graph folders).
    • Lock Chambers: LockChamberArea_NL_20260224.geojson (representing lock chamber polygons in ~/src/fis/output/euris-export/).

3. CRS & Reprojection Strategy

To ensure distance-based calculations (buffer widths in profiles, segment lengths in centerline snapping) are performed in accurate meters, the entire pipeline operates in the metric EPSG:3857 projection:

  • CLI/Ingestion Layer:
    • Responsible for loading the query geometries (polygons, cross profiles, waterway centerline).
    • Normalizes the input geometries by reprojecting them to EPSG:3857 immediately if they are in another CRS (e.g., EPSG:4326).
    • Confirms that the loaded AIS geoparquet is in EPSG:3857.
  • Core Calculation Layer:
    • Strictly requires both datasets to have the matching CRS EPSG:3857.
    • Raises a ValueError immediately if there is a CRS mismatch or if the CRS is not metric (EPSG:3857).

4. Analysis Features & Advanced Algorithms

We will add a new module src/ais_shader/analysis.py containing three specialized algorithms:

A. Passage Line Crossing Analysis

  • Goal: Compute crossing speed histograms for vessels crossing cross-section lines.
  • Algorithm:
    1. Sort AIS points per vessel (pseudo_id / MMSI) chronologically.
    2. For consecutive points $P_t$ and $P_{t+1}$, construct the segment $S_t = \text{LineString}([P_t, P_{t+1}])$.
    3. Perform a spatial join (intersects) between these segments and the cross-section lines.
    4. For intersecting segments:
      • Project the intersection point $I$ onto the segment to calculate its fractional distance $f$.
      • Linearly interpolate the speed at the crossing: $sog_{cross} = sog_t + f \times (sog_{t+1} - sog_t)$.
    5. Bin the crossing speeds and append the histogram columns to the original cross-section GeoDataFrame.

B. Fairway Snapping & Binning

  • Goal: Project AIS points onto nearby fairway edge linestrings and bin speeds along their lengths.
  • Algorithm:
    1. Find the nearest edge for each AIS point using a spatial index (gpd.sjoin_nearest within a reasonable search radius).
    2. For each point snapped to an edge, project it onto that edge using vectorized line_locate_point to get the distance along the edge.
    3. Divide each edge into segments of segment_length (e.g., 100 meters).
    4. Assign points to their corresponding edge segment, group by segment, and compute speed histograms.
    5. Return a GeoDataFrame containing the segmented fairway lines with the speed histograms attached.

C. Lock Chamber Visit Duration

  • Goal: Compute visit durations and duration histograms for lock chamber polygons.
  • Algorithm:
    1. Spatial join (within) to find all AIS points falling inside each lock chamber.
    2. Group points by lock chamber ID and vessel ID (pseudo_id / MMSI), sorted chronologically.
    3. Partition the points into distinct visits: if the time gap between consecutive points of a vessel in the same lock exceeds a threshold (e.g., 2 hours), split them into separate visits.
    4. For each visit, compute the duration: $duration = t_{exit} - t_{entry}$.
    5. Bin these durations (e.g. in minutes: [0, 15, 30, 45, 60, 90, 120, 180]) and append the count columns to the lock chamber GeoDataFrame.

5. CLI commands

We will add a new CLI command group under ais-shader:

  • ais-shader analyze-passage: Compute passage line crossings.
  • ais-shader analyze-fairway: Project and bin speeds along fairway edges.
  • ais-shader analyze-lock: Compute visit durations inside lock chamber polygons.

All commands will export output files as GeoJSON/GeoPackage for direct visualization in GIS tools.

6. Test Cases & Validation

We will implement the following tests in tests/test_analysis.py:

  1. test_passage_crossing: Create a straight trajectory intersecting a cross-section line. Verify that the crossing speed is interpolated correctly.
  2. test_fairway_snap: Snap random points to parallel line segments, and verify that projection distances along the lines are computed accurately.
  3. test_lock_duration: Create two separate visits of a vessel to a lock chamber separated by a large time gap. Verify they are split into two visits and durations are calculated correctly.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions