Skip to content

Question: How to improve KNN performance on big datasets ? #605

@CommanderWahid

Description

@CommanderWahid

Hello,

I am working on building a geospatial platform, and one of our use cases is to generate the K nearest neighbours (k=1) between two datasets.

I come across Mosaic’s SpatialKNN implementation and I've tested it.

On the landmark side, I have 90 million rows (100% multipolygons), and on the candidate side, there are 5 million rows (80% linestrings and 20% multipolygons).
The index resolution is set to 10, and I’m running the code on Databricks DBR 13.3 LTS.
Cluster spec: 64 Cores / 256 GB Memory (delta cache accelerated enabled).

I managed to get accurate results when I limited the landmark dataset to 100 rows and included all data from the candidate dataset.

However, when I used the full landmark dataset, the job couldn’t finish, and I had to cancel it.

I suspect that the slowness is due to handling the multipolygons:

  • The grid_tessellateexplode of multipolygons takes a significant amount of time. On the candidate dataset (with just linestrings) it tooks 4 minutes, while by adding multipolygons rows to linetsrings, it tooks 30 minutes.

  • The job hangs on, the first iteration, the grid_geometrykringexplode step of the landmark dataset.

On the Spark UI, I don’t see any issues that could guide me toward a solution.

Reducing the number of iterations doesn't help because the job couldn't even manage to finish the first iteration.
Also, reducing the index resolution to 8, decreased the candidates dataset preparation time by 5 minutes, but had no impact on the landmark side.

Do you have please any recommendations regarding the cluster sizing or other optimizations that could improve the performance ?

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions