Many datasets have multiple clusters/grids where field labels are present, within the borders of which the labels are presence/absence (fields are fully labeled). On their borders, however, there may still be fields that are not labeled. If chips are placed along these borders, there might be partial labeling in the mask for that chip.
The --drop-border-chips flag addresses this somewhat, but only removes chips on the border of the convex hull computed over all of the fields (not per-cluster/grid). This means the borders that are between clusters are inside the dataset-level convex hull, and the chips on those interior borders don't get dropped.
See for example in Estonia, which is representative of the case for European countries in FTW that have two grids that were sampled:

This is even worse for regions that have many clusters like Cambodia:

We need a better solution for --drop-border-chips that accounts for the interior boundaries too. This is hard because there is no information in the parquet files that indicate which cluster/grid the fields are in, and we don't always know how many clusters/grids there are.
One solution might be to use DBSCAN on the tile IDs or lat/lons, but we don't want it to be too slow.
Many datasets have multiple clusters/grids where field labels are present, within the borders of which the labels are presence/absence (fields are fully labeled). On their borders, however, there may still be fields that are not labeled. If chips are placed along these borders, there might be partial labeling in the mask for that chip.
The
--drop-border-chipsflag addresses this somewhat, but only removes chips on the border of the convex hull computed over all of the fields (not per-cluster/grid). This means the borders that are between clusters are inside the dataset-level convex hull, and the chips on those interior borders don't get dropped.See for example in Estonia, which is representative of the case for European countries in FTW that have two grids that were sampled:

This is even worse for regions that have many clusters like Cambodia:

We need a better solution for
--drop-border-chipsthat accounts for the interior boundaries too. This is hard because there is no information in the parquet files that indicate which cluster/grid the fields are in, and we don't always know how many clusters/grids there are.One solution might be to use DBSCAN on the tile IDs or lat/lons, but we don't want it to be too slow.