GCS Benchmarks by xaviernogueira · Pull Request #70 · datafusion-contrib/arrow-zarr

xaviernogueira · 2026-01-21T07:16:48Z

Contribution Context:
Super cool project you have here! I was privately working on my own version of arrow-zarr but without datafusion (parquet focused) and with a lot less progress than you. Anyways I gave up on that and am looking to help build things here and be involved in this project, especially around raster<>vector UDFs over time.

That said this is my first PR in rust, I am a bit of a beginner, so some of the functionality I'd like to get into is above my pay-grade at the moment. To help my learning I decided to go for some low hanging fruit here.

Changes:

Decomposed s3_bench.rs to create a CloudStorageBenchBackend trait in shared.rs -> reduces duplication as additional benchmark backends are added.
Implemented a GCS version of the S3 benchmark using said interface.

Results (around 2x slower than s3?):

gcs_benchmarks/join_benchmark
  time:   [16.424 s 17.994 s 19.681 s]
  change: [+96.419% +119.81% +145.68%] (p = 0.00 < 0.05)
  Performance has regressed.

gcs_benchmarks/union_benchmark
  time:   [11.741 s 11.878 s 12.022 s]
  change: [+88.345% +101.73% +117.00%] (p = 0.00 < 0.05)
  Performance has regressed.

xaviernogueira · 2026-01-22T19:53:14Z

Requesting review from contributors: @maximedion2 @kylebarron @tshauck @alamb

maximedion2 · 2026-01-22T20:06:21Z

Hey there! Thanks for the contribution! A few notes:

I'm the only one really making contributions right now (although there's been some good discussions with @alxmrs who's also working on something similar), Andrew Lamb just created the repo initially since it's a datafusion contrib (tbh I might move this to somewhere else, like my own repo, I'm not sure it should be a datafusion contrib anymore, plus I want to change the name haha), and Kyle and Trent have previously made contributions, but I'll be the one reviewing the PR.
We're going to have to wait a little bit to merge this, I have a PR with python bindings that I will be merging soon, unfortunately it modifies the repo structure to create separate crates. Shouldn't be too bad, we can discuss how to update this PR after I merge mine.
Re: the numbers you are seeing, I remember the Arraylake guys saying that there was some specialized S3 implementation they did for Icechunk, and that it was faster than out-of-the-box S3 stuff, was you bench for an Icechunk repo?

In any case, thanks for helping out, I have a lot of things I want to do with this project but not nearly as much time as I would like, happy to get some help on this!

xaviernogueira · 2026-01-22T20:32:36Z

@maximedion2 Thanks for the quick response! I totally understand, ping me when your PR merges and I can go deal with conflicts here. This was half a rust learning exercise for me so merge or not I enjoyed the experience.

Regarding the numbers I actually don't really know what it is relative to (hence the "?") as I only ran the GCS bench (don't have a AWS bucket set up but can do that a bit later) and I only ran it once...kinda odd!

I'll be much more free to contribute in the spring as I'll be becoming a freelancer / leaving my full time due to relocation out of USA in April. Looking forward to properly diving in there, especially around raster<>vector stuff + additional dimensional.

If you want, also feel free to make Issues that you don't have time for, I could use those as oppurtunities to jump in, as I know if can be a touch disruptive to receive un-requested PRs.

alamb · 2026-01-22T20:53:45Z

Thanks for the request, but sadly I am not likely to have time to review PRs in this repo

…

On Thu, Jan 22, 2026 at 2:53 PM Xavier Nogueira ***@***.***> wrote: *xaviernogueira* left a comment (datafusion-contrib/arrow-zarr#70) <#70 (comment)> Requesting review from contributors: @maximedion2 <https://github.com/maximedion2> @kylebarron <https://github.com/kylebarron> @tshauck <https://github.com/tshauck> @alamb <https://github.com/alamb> — Reply to this email directly, view it on GitHub <#70 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AADXZMIPCDW6W4HWIZGPGC34IETEDAVCNFSM6AAAAACSLXO5ASVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZTOOBWGM4DKNJYGY> . You are receiving this because you were mentioned.Message ID: ***@***.***>

xaviernogueira · 2026-01-22T21:02:19Z

@alamb ha automated review request rejection that's awesome, didn't actually realize it was you from influxDB wouldn't have requested if I did I assume you are quite busy.

alamb · 2026-01-22T21:28:51Z

Well it wans't quite automated -- I responded to an email github alert :)

maximedion2 · 2026-01-23T02:10:19Z

@maximedion2 Thanks for the quick response! I totally understand, ping me when your PR merges and I can go deal with conflicts here. This was half a rust learning exercise for me so merge or not I enjoyed the experience.

Regarding the numbers I actually don't really know what it is relative to (hence the "?") as I only ran the GCS bench (don't have a AWS bucket set up but can do that a bit later) and I only ran it once...kinda odd!

I'll be much more free to contribute in the spring as I'll be becoming a freelancer / leaving my full time due to relocation out of USA in April. Looking forward to properly diving in there, especially around raster<>vector stuff + additional dimensional.

If you want, also feel free to make Issues that you don't have time for, I could use those as oppurtunities to jump in, as I know if can be a touch disruptive to receive un-requested PRs.

Okay I just merged my PR, I'll take a look at yours probably tomorrow.

Ah, right, I don't have a GCS account, I'm more familiar with AWS so I set that up initially (and it's kind of why I was procrastinating about implementing GCS support haha) so it's good that you have it the other way around. I'd be curious about a side by side comparison between AWS and GCS though.

Sounds good, I do have a couple good issues to work on, that are not necessarily urgent, I'll create those and you can pick them up if/when you have time (and of course feel free to ask any questions you'd like, happy to discuss).

maximedion2 · 2026-01-23T02:31:09Z

oh and @xaviernogueira I'm not completely sure what you mean by "raster <> vector stuff", but maybe related to that, part of my plan is to implement spatial functions specialized for points (e.g. a ST_Within where one side is always a point geo and the other is just any type of geo). SedonaDB already implements the traditional spatial stuff in datafusion I think, but I'd like to specialize this crate on raster data from zarr/icechunk, I have a few ideas regarding spatial operations for points, which may or may not be more fun than useful, only way to know is to try!

xaviernogueira · 2026-01-23T05:03:01Z

@maximedion2 sweet, good stuff with the Python bindings. I should have some time this weekend to loop back around and get the merge conflicts resolved and address your review (if you do it by then...no rush!).

And yes I really only use GCP although I may change that for my own stuff going forward as its kinda pricey.

As by "raster<>vector stuff" yes I am referring to functionality currently in SedonaDB like ZonalStats, so plugging into that via a Raster consutrctor (probably closest to RS_fromNetCDF. That said I am also refering to functionality that is missing from SedonaDB (that i am familiar with via ArcGIS), but working on those is more of a longer term goal, and would likely be SedonaDB contributions anyways so out of scope of this convo. My thoughts here are a bit half baked and evolving tbh, and should come more together in the Spring when I can more fully focus on OSS vs the stuff I am doing at my job currently.

maximedion2 · 2026-01-23T15:02:20Z

@maximedion2 sweet, good stuff with the Python bindings. I should have some time this weekend to loop back around and get the merge conflicts resolved and address your review (if you do it by then...no rush!).

And yes I really only use GCP although I may change that for my own stuff going forward as its kinda pricey.

As by "raster<>vector stuff" yes I am referring to functionality currently in SedonaDB like ZonalStats, so plugging into that via a Raster consutrctor (probably closest to RS_fromNetCDF. That said I am also refering to functionality that is missing from SedonaDB (that i am familiar with via ArcGIS), but working on those is more of a longer term goal, and would likely be SedonaDB contributions anyways so out of scope of this convo. My thoughts here are a bit half baked and evolving tbh, and should come more together in the Spring when I can more fully focus on OSS vs the stuff I am doing at my job currently.

Ah I see. Yeah I was looking at SedonaDB recently, a lot of the stuff is pub so components can be re-used. I think rasters are a WIP, but once I see the internals of how it works in SedonaDB we could certainly add a "RS_fromZarr" or something like that. And it seems things are very modular in datafusion, so we could potentially spin up a SessionContext, register the RS_fromZarr, register whatever UDFs you need from SedonaDB (e.g. ZonalStats), and combine all that so that it's all available in the same query. And then of course we can add whatever custom functionality (ExecutionPlan, UDFs) on top of that.

maximedion2 · 2026-01-26T00:22:54Z

+// ============================================================================
+
+struct GCSBenchBackend {
+    _bucket: String,


nit: everything is private unless you say it's public in rust, no need for the underscores for class members.

maximedion2 · 2026-01-26T00:25:00Z

+    }
+
+    async fn cleanup(&self) {
+        // Cleanup is handled by the TestFixture Drop implementation


Not sure I follow... the test fixture calls this method when it's dropped, but you haven't implemented anything here (but you did for the s3 version)?
Generally speaking though, why do we need this, that gets called in the test fixture drop? why not simply implement drop on the gcs and s3 backends directly?

maximedion2 · 2026-01-26T00:33:09Z

@@ -176,6 +176,26 @@ impl ZarrTableUrl {
                        .await
                        .map_err(|e| DataFusionError::External(Box::new(e)))?
                    }
+                    "gs" => {


can you also implement the zarr case? it's right above it in the code, it relies on ObjectStore (which has a gcs backend).

also, I was gonna ask why you have a use statement here, but I just saw that that's what I did too apparently... generally speaking I think that's probably not a good pattern, that was just me being sloppy.

but, this has made me realize that I need to properly make s3 and gcs optional, set up the right features, etc... So let's leave it as is for now, I will pick this up after we merge the PR.

maximedion2 · 2026-02-04T02:33:09Z

@xaviernogueira were you planning on picking this up soon? No rush, just want to know when I can start working on making S3/GCS proper features in the crate.

xaviernogueira · 2026-02-04T08:33:22Z

@maximedion2 busy with a sprint at my job right now and getting setup in a new country! If you want to pick it up feel free, I'll definitely circle back at some point soonish

maximedion2 · 2026-02-05T02:30:12Z

@maximedion2 busy with a sprint at my job right now and getting setup in a new country! If you want to pick it up feel free, I'll definitely circle back at some point soonish

Ah I see, no problem I have a few other things that I want to work on that are not blocked by this, I'll get started on it and I'll circle back to this PR in a little while, if you're still busy I'll pick it up, but I'll give you more time to take a look if you want.

jiayuasu · 2026-02-18T18:56:10Z

Thanks for the great discussion.

The raster capabilities in SedonaDB are indeed under active development, but they primarily reimplement the functionality already available in SedonaSpark: https://sedona.apache.org/latest/tutorial/raster/ SedonaSpark raster functions have been used by many users in production.

Both SedonaDB and SedonaSpark adopt the PostGIS Raster data model, which allows us to remain compatible with PostGIS.

maximedion2 · 2026-02-19T04:09:27Z

@maximedion2 busy with a sprint at my job right now and getting setup in a new country! If you want to pick it up feel free, I'll definitely circle back at some point soonish

Ah I see, no problem I have a few other things that I want to work on that are not blocked by this, I'll get started on it and I'll circle back to this PR in a little while, if you're still busy I'll pick it up, but I'll give you more time to take a look if you want.

Thanks for the great discussion.

The raster capabilities in SedonaDB are indeed under active development, but they primarily reimplement the functionality already available in SedonaSpark: https://sedona.apache.org/latest/tutorial/raster/ SedonaSpark raster functions have been used by many users in production.

Both SedonaDB and SedonaSpark adopt the PostGIS Raster data model, which allows us to remain compatible with PostGIS.

Thanks for the note! I'll check in with @xaviernogueira, I have several things I want to work on with this project, some ideas I want to try, etc... and I think he was interested in the "raster from zarr" concept, so when he's more available maybe that could be a really cool project to work on. I'll check the docs/tutorial you mentioned in the meantime, I'm not super familiar with rasters so I can definitely use a primer!

xaviernogueira added 4 commits January 21, 2026 00:47

✏️ Decomposing S3 bench code

d104e23

✨ adding GCS bench + dev dependency

d13f698

✏️ fixing things

7ec113b

✏️ getting things to work with google-cloud-storage v1.6

cf31198

xaviernogueira marked this pull request as ready for review January 22, 2026 19:52

maximedion2 self-requested a review January 22, 2026 20:03

maximedion2 reviewed Jan 26, 2026

View reviewed changes

maximedion2 linked an issue Feb 4, 2026 that may be closed by this pull request

Add support for GCS #71

Open

xaviernogueira mentioned this pull request Feb 7, 2026

Implementation idea: make a custom catalog provider alxmrs/xarray-sql#85

Open

Merge remote-tracking branch 'upstream/main' into gcs-benchmarks

6cfeafc

Conversation

xaviernogueira commented Jan 21, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

xaviernogueira commented Jan 22, 2026

Uh oh!

maximedion2 commented Jan 22, 2026

Uh oh!

xaviernogueira commented Jan 22, 2026

Uh oh!

alamb commented Jan 22, 2026 via email

Uh oh!

xaviernogueira commented Jan 22, 2026

Uh oh!

alamb commented Jan 22, 2026

Uh oh!

maximedion2 commented Jan 23, 2026

Uh oh!

maximedion2 commented Jan 23, 2026

Uh oh!

xaviernogueira commented Jan 23, 2026

Uh oh!

maximedion2 commented Jan 23, 2026

Uh oh!

maximedion2 Jan 26, 2026

Choose a reason for hiding this comment

Uh oh!

maximedion2 Jan 26, 2026

Choose a reason for hiding this comment

Uh oh!

maximedion2 Jan 26, 2026

Choose a reason for hiding this comment

Uh oh!

maximedion2 commented Feb 4, 2026

Uh oh!

xaviernogueira commented Feb 4, 2026

Uh oh!

maximedion2 commented Feb 5, 2026

Uh oh!

jiayuasu commented Feb 18, 2026

Uh oh!

maximedion2 commented Feb 19, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

xaviernogueira commented Jan 21, 2026 •

edited

Loading