Observations from R and Python benchmarks: performance bottlenecks and optimization ideas for sedona-db #2576

Robinlovelace · 2026-01-02T23:52:30Z

Robinlovelace
Jan 2, 2026

I have tested sedona-db R and Python interfaces vs established tools like geopandas and sf (R). You can find the full reproducible
setup and current results here: Robinlovelace/geobench.

The benchmarks show impressive performance for Sedona. Hoping the results are of interest, and I think some of the observations below could point the way to speed-ups in the interfaces, I'm quite new to the project so sharing these observations as a discussion rather than bombarding the project with issues, I've already opened one as you'll see in link below!

Python: The Shapely Deserialization Bottleneck
In my Python benchmarks, I noticed a significant drop in performance when collecting query results using .to_pandas().

Observation: .to_pandas() triggers a conversion from internal Arrow/WKB binary geometries into Python shapely objects. For a 100k point dataset, this conversion becomes the
dominant cost.
Workaround: By bypassing .to_pandas() and staying in the Arrow/Polars ecosystem, I achieved a ~5.5x speedup.
Implementation: See bench_sedona_polars.py script.
Question: Would the team be open to adding a native .to_polars() method (or similar) to keep geometries in efficient binary format?

R: Direct File Ingestion (GDAL/OGR)
I found that the R interface currently lacks an equivalent to the Python sd.read_pyogrio().

Current Path: To benchmark loading from disk, I have to read via sf::st_read() and then pipe to sd_to_view(). This materializes heavy R sf objects in memory before they are
converted back to Arrow for Sedona.
Opportunity: Implementing an sd_read_gdal() or sd_read_pyogrio() equivalent in the R interface (similar to the path in my R sedonadb) would allow direct file-to-engine ingestion via Arrow C Streams, drastically reducing startup
overhead.

Roadmap: Complex Linestring Operations
I am curious about the roadmap for parallelized linestring operations in the native engine. Specifically:

Does the engine currently support (or plan to support) high-performance line merges and line splits (ST_LineMerge, ST_Subdivide)?
Given the scaling I see in spatial joins, I'm wondering if Sedona could be positioned to become the go-to tool for large-scale topological network cleaning.

I opened a specific issue: PhysicalOptimizer rule 'join_selection' failed: Schema mismatch in ST_Intersects query (Python) sedona-db#477

I’d love to hear the community’s thoughts on these observations and how I can best contribute to testing these high-performance paths!

jiayuasu · 2026-01-03T05:11:45Z

jiayuasu
Jan 3, 2026
Collaborator

@paleolimbot @Kontinuation @zhangfengcdt @james-willis @Imbruced

0 replies

Imbruced · 2026-01-03T23:45:59Z

Imbruced
Jan 3, 2026
Collaborator

Did you call to_pandas only, or did you perform some operations on top of the result? The to_pandas method hardly relies on the
GeoPandas from the arrow method. I assume that constructing shapely objects from wkb takes most of the time in this method.

GeoDataFrame.from_arrow

@Kontinuation, do you think there is a way to combine the C serde you wrote for Sedona and shapely conversions while ago? Do you think this even makes sense to do?

I am not super familiar with Polars, but you mean to convert it to Polars or GeoPolars? I see that you did similar code to this.

table = df.to_arrow_table()
polars_df = polars.from_arrow(table)

One thing that affected your benchmark is that, for SedondDB to pandas, you created shapely objects from the wkb in arrow, whereas for polars you just took the raw binary and did nothing with it. I am wondering if you could load it to geopolars maybe and do some operations later on it and with geopandas and compare the times?

@Robinlovelace, by any chance, do you have some benchmarks on the SedonaDB UDFs?

0 replies

paleolimbot · 2026-01-04T03:41:52Z

paleolimbot
Jan 4, 2026
Collaborator

Very cool! Thank you for putting together these benchmarks! I think you're right about these bullets:

We can possibly make SedonaDB->geopandas faster
We really do need to be able to read GDAL/OGR via R. I'll open a ticket for this and see if I can squeeze it in...it's more complicated than Python since we don't have pyogrio to help.
We'd love to add ST_LineMerge and ST_Subdivide. These are "just" GEOS functions and have already been merged to georust/geos ( https://github.com/georust/geos/blob/47afbad2483e489911ddb456417808340e9342c3/src/geometry.rs#L2789-L2801 ). I'll open tickets for these.
I'll fix the schema mismatch issue this week 🙂

Reading GeoPackages and converting outside the Arrow universe are always going to be slower than GeoParquet + staying inside the Arrow universe, and part of SedonaDB is strengthening those ecosystems to the point that those operations don't have to happen (i.e., we also want to make SedonaDB->geopandas/sf and reading .gpkg files unnecessary most of the time by making sure we support the next step).

If I'm reading these correctly, these benchmarks are of a .gpkg read(s), followed by some operation, with a collect back into various existing frameworks. I think the reason sedonadb-sf appears so fast is that you're using sd_collect(), which doesn't actually produce sf objects but something closer to a zero-copy ALTREP wrapper around the array (a geoarrow_vctr, to be precise). If you changed sd_collect() to st_as_sf() I think you'd see something more similar to sedonadb-geopandas.

I'm not sure why sedonadb-polars isn't identical to sedonadb-sf for the spatial_join benchmark (I would have expected those results to be identical). I think that geopandas caches the spatial index and I'm not sure you have a totally "fresh" GeoDataFrame for each iteration of your benchmark (or also, this might be a case where Python/R string handling shines over Arrow since there are a lot of repeated strings in the output). 16 Polygons x 100k points is pretty small and I'm pleased with the fact that SedonaDB doesn't add so much overhead that the performance on that sort of microbenchmark is reasonable.

how I can best contribute to testing these high-performance paths!

Continuing to kick the tires and write about it is fantastic! Knowing that there's interest in SedonaDB for R is helpful (it's a bit of a side project for the other SedonaDB work I do and it's motivating if I know that anybody actually plans on using it 🙂 ). I'm not sure I will ever get to writing a GeocompX variant but in theory that's what we're trying to provide with SedonaDB and it's a great blueprint for stuff that SedonaDB should be able to do at some point.

3 replies

Robinlovelace Jan 5, 2026
Author

Many thanks for the responses! This is definitely in the spirit of "let's kick the tires and give it a spin" so thanks for seeing it that way. Great point about not returning sf: I think R variants returning as arrow (what would that look like in R, an arrow object with a binary geo list column?) vs sf is a good plan and will open an issue on that. I don't have much time to dig deeper on this. However, I have got to the stage where I'm like "cool it works" 🎉 so yes I plan to write about it and will put in our geocompx blog with input from anyone who's interested in contributing. I haven't done a blog in a while, and want to make it as useful as possible for this project and open source software in general so will be great to get your/apache-db team input on that. This will be fun!

Robinlovelace Jan 5, 2026
Author

Robinlovelace/geobench#2

paleolimbot Jan 5, 2026
Collaborator

Awesome! Feel free to ping me for a review on the blog post or anything else!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Observations from R and Python benchmarks: performance bottlenecks and optimization ideas for sedona-db #2576

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Replies: 3 comments 3 replies

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Observations from R and Python benchmarks: performance bottlenecks and optimization ideas for sedona-db #2576

Uh oh!

Uh oh!

Robinlovelace Jan 2, 2026

Replies: 3 comments · 3 replies

Uh oh!

jiayuasu Jan 3, 2026 Collaborator

Uh oh!

Imbruced Jan 3, 2026 Collaborator

Uh oh!

paleolimbot Jan 4, 2026 Collaborator

Uh oh!

Robinlovelace Jan 5, 2026 Author

Uh oh!

Robinlovelace Jan 5, 2026 Author

Uh oh!

paleolimbot Jan 5, 2026 Collaborator

Robinlovelace
Jan 2, 2026

Replies: 3 comments 3 replies

jiayuasu
Jan 3, 2026
Collaborator

Imbruced
Jan 3, 2026
Collaborator

paleolimbot
Jan 4, 2026
Collaborator

Robinlovelace Jan 5, 2026
Author

Robinlovelace Jan 5, 2026
Author

paleolimbot Jan 5, 2026
Collaborator