Skip to content

Handle border cases where labels absent in presence-absence datasets #29

@hannah-rae

Description

@hannah-rae

Many datasets have multiple clusters/grids where field labels are present, within the borders of which the labels are presence/absence (fields are fully labeled). On their borders, however, there may still be fields that are not labeled. If chips are placed along these borders, there might be partial labeling in the mask for that chip.

The --drop-border-chips flag addresses this somewhat, but only removes chips on the border of the convex hull computed over all of the fields (not per-cluster/grid). This means the borders that are between clusters are inside the dataset-level convex hull, and the chips on those interior borders don't get dropped.

See for example in Estonia, which is representative of the case for European countries in FTW that have two grids that were sampled:
Image

This is even worse for regions that have many clusters like Cambodia:
Image

We need a better solution for --drop-border-chips that accounts for the interior boundaries too. This is hard because there is no information in the parquet files that indicate which cluster/grid the fields are in, and we don't always know how many clusters/grids there are.

One solution might be to use DBSCAN on the tile IDs or lat/lons, but we don't want it to be too slow.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions