-
Notifications
You must be signed in to change notification settings - Fork 83
Description
Hello,
I am working on building a geospatial platform, and one of our use cases is to generate the K nearest neighbours (k=1) between two datasets.
I come across Mosaic’s SpatialKNN implementation and I've tested it.
On the landmark side, I have 90 million rows (100% multipolygons), and on the candidate side, there are 5 million rows (80% linestrings and 20% multipolygons).
The index resolution is set to 10, and I’m running the code on Databricks DBR 13.3 LTS.
Cluster spec: 64 Cores / 256 GB Memory (delta cache accelerated enabled).
I managed to get accurate results when I limited the landmark dataset to 100 rows and included all data from the candidate dataset.
However, when I used the full landmark dataset, the job couldn’t finish, and I had to cancel it.
I suspect that the slowness is due to handling the multipolygons:
-
The grid_tessellateexplode of multipolygons takes a significant amount of time. On the candidate dataset (with just linestrings) it tooks 4 minutes, while by adding multipolygons rows to linetsrings, it tooks 30 minutes.
-
The job hangs on, the first iteration, the grid_geometrykringexplode step of the landmark dataset.
On the Spark UI, I don’t see any issues that could guide me toward a solution.
Reducing the number of iterations doesn't help because the job couldn't even manage to finish the first iteration.
Also, reducing the index resolution to 8, decreased the candidates dataset preparation time by 5 minutes, but had no impact on the landmark side.
Do you have please any recommendations regarding the cluster sizing or other optimizations that could improve the performance ?