Skip to content

GCS Benchmarks#70

Open
xaviernogueira wants to merge 5 commits into
datafusion-contrib:mainfrom
xaviernogueira:gcs-benchmarks
Open

GCS Benchmarks#70
xaviernogueira wants to merge 5 commits into
datafusion-contrib:mainfrom
xaviernogueira:gcs-benchmarks

Conversation

@xaviernogueira
Copy link
Copy Markdown

@xaviernogueira xaviernogueira commented Jan 21, 2026

Contribution Context:
Super cool project you have here! I was privately working on my own version of arrow-zarr but without datafusion (parquet focused) and with a lot less progress than you. Anyways I gave up on that and am looking to help build things here and be involved in this project, especially around raster<>vector UDFs over time.

That said this is my first PR in rust, I am a bit of a beginner, so some of the functionality I'd like to get into is above my pay-grade at the moment. To help my learning I decided to go for some low hanging fruit here.

Changes:

  • Decomposed s3_bench.rs to create a CloudStorageBenchBackend trait in shared.rs -> reduces duplication as additional benchmark backends are added.
  • Implemented a GCS version of the S3 benchmark using said interface.

Results (around 2x slower than s3?):

gcs_benchmarks/join_benchmark
  time:   [16.424 s 17.994 s 19.681 s]
  change: [+96.419% +119.81% +145.68%] (p = 0.00 < 0.05)
  Performance has regressed.

gcs_benchmarks/union_benchmark
  time:   [11.741 s 11.878 s 12.022 s]
  change: [+88.345% +101.73% +117.00%] (p = 0.00 < 0.05)
  Performance has regressed.

@xaviernogueira xaviernogueira marked this pull request as ready for review January 22, 2026 19:52
@xaviernogueira
Copy link
Copy Markdown
Author

Requesting review from contributors: @maximedion2 @kylebarron @tshauck @alamb

@maximedion2 maximedion2 self-requested a review January 22, 2026 20:03
@maximedion2
Copy link
Copy Markdown
Collaborator

Hey there! Thanks for the contribution! A few notes:

  • I'm the only one really making contributions right now (although there's been some good discussions with @alxmrs who's also working on something similar), Andrew Lamb just created the repo initially since it's a datafusion contrib (tbh I might move this to somewhere else, like my own repo, I'm not sure it should be a datafusion contrib anymore, plus I want to change the name haha), and Kyle and Trent have previously made contributions, but I'll be the one reviewing the PR.
  • We're going to have to wait a little bit to merge this, I have a PR with python bindings that I will be merging soon, unfortunately it modifies the repo structure to create separate crates. Shouldn't be too bad, we can discuss how to update this PR after I merge mine.
  • Re: the numbers you are seeing, I remember the Arraylake guys saying that there was some specialized S3 implementation they did for Icechunk, and that it was faster than out-of-the-box S3 stuff, was you bench for an Icechunk repo?

In any case, thanks for helping out, I have a lot of things I want to do with this project but not nearly as much time as I would like, happy to get some help on this!

@xaviernogueira
Copy link
Copy Markdown
Author

@maximedion2 Thanks for the quick response! I totally understand, ping me when your PR merges and I can go deal with conflicts here. This was half a rust learning exercise for me so merge or not I enjoyed the experience.

Regarding the numbers I actually don't really know what it is relative to (hence the "?") as I only ran the GCS bench (don't have a AWS bucket set up but can do that a bit later) and I only ran it once...kinda odd!

I'll be much more free to contribute in the spring as I'll be becoming a freelancer / leaving my full time due to relocation out of USA in April. Looking forward to properly diving in there, especially around raster<>vector stuff + additional dimensional.

If you want, also feel free to make Issues that you don't have time for, I could use those as oppurtunities to jump in, as I know if can be a touch disruptive to receive un-requested PRs.

@alamb
Copy link
Copy Markdown
Contributor

alamb commented Jan 22, 2026 via email

@xaviernogueira
Copy link
Copy Markdown
Author

@alamb ha automated review request rejection that's awesome, didn't actually realize it was you from influxDB wouldn't have requested if I did I assume you are quite busy.

@alamb
Copy link
Copy Markdown
Contributor

alamb commented Jan 22, 2026

Well it wans't quite automated -- I responded to an email github alert :)

@maximedion2
Copy link
Copy Markdown
Collaborator

@maximedion2 Thanks for the quick response! I totally understand, ping me when your PR merges and I can go deal with conflicts here. This was half a rust learning exercise for me so merge or not I enjoyed the experience.

Regarding the numbers I actually don't really know what it is relative to (hence the "?") as I only ran the GCS bench (don't have a AWS bucket set up but can do that a bit later) and I only ran it once...kinda odd!

I'll be much more free to contribute in the spring as I'll be becoming a freelancer / leaving my full time due to relocation out of USA in April. Looking forward to properly diving in there, especially around raster<>vector stuff + additional dimensional.

If you want, also feel free to make Issues that you don't have time for, I could use those as oppurtunities to jump in, as I know if can be a touch disruptive to receive un-requested PRs.

Okay I just merged my PR, I'll take a look at yours probably tomorrow.

Ah, right, I don't have a GCS account, I'm more familiar with AWS so I set that up initially (and it's kind of why I was procrastinating about implementing GCS support haha) so it's good that you have it the other way around. I'd be curious about a side by side comparison between AWS and GCS though.

Sounds good, I do have a couple good issues to work on, that are not necessarily urgent, I'll create those and you can pick them up if/when you have time (and of course feel free to ask any questions you'd like, happy to discuss).

@maximedion2
Copy link
Copy Markdown
Collaborator

oh and @xaviernogueira I'm not completely sure what you mean by "raster <> vector stuff", but maybe related to that, part of my plan is to implement spatial functions specialized for points (e.g. a ST_Within where one side is always a point geo and the other is just any type of geo). SedonaDB already implements the traditional spatial stuff in datafusion I think, but I'd like to specialize this crate on raster data from zarr/icechunk, I have a few ideas regarding spatial operations for points, which may or may not be more fun than useful, only way to know is to try!

@xaviernogueira
Copy link
Copy Markdown
Author

@maximedion2 sweet, good stuff with the Python bindings. I should have some time this weekend to loop back around and get the merge conflicts resolved and address your review (if you do it by then...no rush!).

And yes I really only use GCP although I may change that for my own stuff going forward as its kinda pricey.

As by "raster<>vector stuff" yes I am referring to functionality currently in SedonaDB like ZonalStats, so plugging into that via a Raster consutrctor (probably closest to RS_fromNetCDF. That said I am also refering to functionality that is missing from SedonaDB (that i am familiar with via ArcGIS), but working on those is more of a longer term goal, and would likely be SedonaDB contributions anyways so out of scope of this convo. My thoughts here are a bit half baked and evolving tbh, and should come more together in the Spring when I can more fully focus on OSS vs the stuff I am doing at my job currently.

@maximedion2
Copy link
Copy Markdown
Collaborator

@maximedion2 sweet, good stuff with the Python bindings. I should have some time this weekend to loop back around and get the merge conflicts resolved and address your review (if you do it by then...no rush!).

And yes I really only use GCP although I may change that for my own stuff going forward as its kinda pricey.

As by "raster<>vector stuff" yes I am referring to functionality currently in SedonaDB like ZonalStats, so plugging into that via a Raster consutrctor (probably closest to RS_fromNetCDF. That said I am also refering to functionality that is missing from SedonaDB (that i am familiar with via ArcGIS), but working on those is more of a longer term goal, and would likely be SedonaDB contributions anyways so out of scope of this convo. My thoughts here are a bit half baked and evolving tbh, and should come more together in the Spring when I can more fully focus on OSS vs the stuff I am doing at my job currently.

Ah I see. Yeah I was looking at SedonaDB recently, a lot of the stuff is pub so components can be re-used. I think rasters are a WIP, but once I see the internals of how it works in SedonaDB we could certainly add a "RS_fromZarr" or something like that. And it seems things are very modular in datafusion, so we could potentially spin up a SessionContext, register the RS_fromZarr, register whatever UDFs you need from SedonaDB (e.g. ZonalStats), and combine all that so that it's all available in the same query. And then of course we can add whatever custom functionality (ExecutionPlan, UDFs) on top of that.

Comment thread benches/gcs_bench.rs Outdated
// ============================================================================

struct GCSBenchBackend {
_bucket: String,
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: everything is private unless you say it's public in rust, no need for the underscores for class members.

Comment thread benches/gcs_bench.rs Outdated
}

async fn cleanup(&self) {
// Cleanup is handled by the TestFixture Drop implementation
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure I follow... the test fixture calls this method when it's dropped, but you haven't implemented anything here (but you did for the s3 version)?
Generally speaking though, why do we need this, that gets called in the test fixture drop? why not simply implement drop on the gcs and s3 backends directly?

@@ -176,6 +176,26 @@ impl ZarrTableUrl {
.await
.map_err(|e| DataFusionError::External(Box::new(e)))?
}
"gs" => {
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you also implement the zarr case? it's right above it in the code, it relies on ObjectStore (which has a gcs backend).

also, I was gonna ask why you have a use statement here, but I just saw that that's what I did too apparently... generally speaking I think that's probably not a good pattern, that was just me being sloppy.

but, this has made me realize that I need to properly make s3 and gcs optional, set up the right features, etc... So let's leave it as is for now, I will pick this up after we merge the PR.

@maximedion2 maximedion2 linked an issue Feb 4, 2026 that may be closed by this pull request
@maximedion2
Copy link
Copy Markdown
Collaborator

@xaviernogueira were you planning on picking this up soon? No rush, just want to know when I can start working on making S3/GCS proper features in the crate.

@xaviernogueira
Copy link
Copy Markdown
Author

@maximedion2 busy with a sprint at my job right now and getting setup in a new country! If you want to pick it up feel free, I'll definitely circle back at some point soonish

@maximedion2
Copy link
Copy Markdown
Collaborator

@maximedion2 busy with a sprint at my job right now and getting setup in a new country! If you want to pick it up feel free, I'll definitely circle back at some point soonish

Ah I see, no problem I have a few other things that I want to work on that are not blocked by this, I'll get started on it and I'll circle back to this PR in a little while, if you're still busy I'll pick it up, but I'll give you more time to take a look if you want.

@jiayuasu
Copy link
Copy Markdown

Thanks for the great discussion.

The raster capabilities in SedonaDB are indeed under active development, but they primarily reimplement the functionality already available in SedonaSpark: https://sedona.apache.org/latest/tutorial/raster/ SedonaSpark raster functions have been used by many users in production.

Both SedonaDB and SedonaSpark adopt the PostGIS Raster data model, which allows us to remain compatible with PostGIS.

@maximedion2
Copy link
Copy Markdown
Collaborator

@maximedion2 busy with a sprint at my job right now and getting setup in a new country! If you want to pick it up feel free, I'll definitely circle back at some point soonish

Ah I see, no problem I have a few other things that I want to work on that are not blocked by this, I'll get started on it and I'll circle back to this PR in a little while, if you're still busy I'll pick it up, but I'll give you more time to take a look if you want.

Thanks for the great discussion.

The raster capabilities in SedonaDB are indeed under active development, but they primarily reimplement the functionality already available in SedonaSpark: https://sedona.apache.org/latest/tutorial/raster/ SedonaSpark raster functions have been used by many users in production.

Both SedonaDB and SedonaSpark adopt the PostGIS Raster data model, which allows us to remain compatible with PostGIS.

Thanks for the note! I'll check in with @xaviernogueira, I have several things I want to work on with this project, some ideas I want to try, etc... and I think he was interested in the "raster from zarr" concept, so when he's more available maybe that could be a really cool project to work on. I'll check the docs/tutorial you mentioned in the meantime, I'm not super familiar with rasters so I can definitely use a primer!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Add support for GCS

4 participants