Observations from R and Python benchmarks: performance bottlenecks and optimization ideas for sedona-db #2576
Replies: 3 comments 3 replies
-
|
@paleolimbot @Kontinuation @zhangfengcdt @james-willis @Imbruced |
Beta Was this translation helpful? Give feedback.
-
GeoDataFrame.from_arrow@Kontinuation, do you think there is a way to combine the C serde you wrote for Sedona and shapely conversions while ago? Do you think this even makes sense to do? I am not super familiar with Polars, but you mean to convert it to Polars or GeoPolars? I see that you did similar code to this. table = df.to_arrow_table()
polars_df = polars.from_arrow(table)One thing that affected your benchmark is that, for SedondDB to pandas, you created shapely objects from the wkb in arrow, whereas for polars you just took the raw binary and did nothing with it. I am wondering if you could load it to geopolars maybe and do some operations later on it and with geopandas and compare the times? @Robinlovelace, by any chance, do you have some benchmarks on the SedonaDB UDFs? |
Beta Was this translation helpful? Give feedback.
-
|
Very cool! Thank you for putting together these benchmarks! I think you're right about these bullets:
Reading GeoPackages and converting outside the Arrow universe are always going to be slower than GeoParquet + staying inside the Arrow universe, and part of SedonaDB is strengthening those ecosystems to the point that those operations don't have to happen (i.e., we also want to make SedonaDB->geopandas/sf and reading .gpkg files unnecessary most of the time by making sure we support the next step). If I'm reading these correctly, these benchmarks are of a I'm not sure why sedonadb-polars isn't identical to sedonadb-sf for the spatial_join benchmark (I would have expected those results to be identical). I think that geopandas caches the spatial index and I'm not sure you have a totally "fresh" GeoDataFrame for each iteration of your benchmark (or also, this might be a case where Python/R string handling shines over Arrow since there are a lot of repeated strings in the output). 16 Polygons x 100k points is pretty small and I'm pleased with the fact that SedonaDB doesn't add so much overhead that the performance on that sort of microbenchmark is reasonable.
Continuing to kick the tires and write about it is fantastic! Knowing that there's interest in SedonaDB for R is helpful (it's a bit of a side project for the other SedonaDB work I do and it's motivating if I know that anybody actually plans on using it 🙂 ). I'm not sure I will ever get to writing a GeocompX variant but in theory that's what we're trying to provide with SedonaDB and it's a great blueprint for stuff that SedonaDB should be able to do at some point. |
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
Uh oh!
There was an error while loading. Please reload this page.
-
I have tested sedona-db R and Python interfaces vs established tools like geopandas and sf (R). You can find the full reproducible
setup and current results here: Robinlovelace/geobench.
The benchmarks show impressive performance for Sedona. Hoping the results are of interest, and I think some of the observations below could point the way to speed-ups in the interfaces, I'm quite new to the project so sharing these observations as a discussion rather than bombarding the project with issues, I've already opened one as you'll see in link below!
In my Python benchmarks, I noticed a significant drop in performance when collecting query results using .to_pandas().
dominant cost.
I found that the R interface currently lacks an equivalent to the Python sd.read_pyogrio().
converted back to Arrow for Sedona.
overhead.
I am curious about the roadmap for parallelized linestring operations in the native engine. Specifically:
I’d love to hear the community’s thoughts on these observations and how I can best contribute to testing these high-performance paths!
Beta Was this translation helpful? Give feedback.
All reactions