This is basically a little break from the zarr stuff (definitely getting back to it soon though), that I've been wanting to look into for a while. First and foremost, this is a learning project, with a secondary (ambitious and optional) goal of writing a faster spatial join operation. I'm following what they have done in SedonaDB to understand the general idea, but the core logic will be very different.
For this first attempt, I will only do a simple ST_Within operation, without some of the more advanced features like writing to disk to avoid running out of memory and other stuff like that. I'm also just relying on binary array for geometries, for now I'm not even gonna bother with labeling it as geo data, all of that can come later if there's any value here.
The approach I want to take is to run all steps of a spatial join "in bulk". Meaning, tree traversal (for the spatial index) will happen on all the geometries in a probe side batch in one call, and refinement will explode all the geometries together into components and run various checks in a few calls (not one single call, can't really do that because different geometry types have different logic). If this works, it could be faster, at least for cases with lots of data, than traversing the spatial index and doing refinement one row at a time.
I've already made some good progress on this, I'm hoping to have something I can benchmark in several days, maybe a week or two, we'll see. There's definitely a lot of challenges, but it's fun to see how spatial joins work, so definitely a good learning experience, even if nothing comes out of it.
This is basically a little break from the zarr stuff (definitely getting back to it soon though), that I've been wanting to look into for a while. First and foremost, this is a learning project, with a secondary (ambitious and optional) goal of writing a faster spatial join operation. I'm following what they have done in SedonaDB to understand the general idea, but the core logic will be very different.
For this first attempt, I will only do a simple ST_Within operation, without some of the more advanced features like writing to disk to avoid running out of memory and other stuff like that. I'm also just relying on binary array for geometries, for now I'm not even gonna bother with labeling it as geo data, all of that can come later if there's any value here.
The approach I want to take is to run all steps of a spatial join "in bulk". Meaning, tree traversal (for the spatial index) will happen on all the geometries in a probe side batch in one call, and refinement will explode all the geometries together into components and run various checks in a few calls (not one single call, can't really do that because different geometry types have different logic). If this works, it could be faster, at least for cases with lots of data, than traversing the spatial index and doing refinement one row at a time.
I've already made some good progress on this, I'm hoping to have something I can benchmark in several days, maybe a week or two, we'll see. There's definitely a lot of challenges, but it's fun to see how spatial joins work, so definitely a good learning experience, even if nothing comes out of it.