GCS Benchmarks#70
Conversation
|
Requesting review from contributors: @maximedion2 @kylebarron @tshauck @alamb |
|
Hey there! Thanks for the contribution! A few notes:
In any case, thanks for helping out, I have a lot of things I want to do with this project but not nearly as much time as I would like, happy to get some help on this! |
|
@maximedion2 Thanks for the quick response! I totally understand, ping me when your PR merges and I can go deal with conflicts here. This was half a rust learning exercise for me so merge or not I enjoyed the experience. Regarding the numbers I actually don't really know what it is relative to (hence the "?") as I only ran the GCS bench (don't have a AWS bucket set up but can do that a bit later) and I only ran it once...kinda odd! I'll be much more free to contribute in the spring as I'll be becoming a freelancer / leaving my full time due to relocation out of USA in April. Looking forward to properly diving in there, especially around raster<>vector stuff + additional dimensional. If you want, also feel free to make Issues that you don't have time for, I could use those as oppurtunities to jump in, as I know if can be a touch disruptive to receive un-requested PRs. |
|
Thanks for the request, but sadly I am not likely to have time to review
PRs in this repo
…On Thu, Jan 22, 2026 at 2:53 PM Xavier Nogueira ***@***.***> wrote:
*xaviernogueira* left a comment (datafusion-contrib/arrow-zarr#70)
<#70 (comment)>
Requesting review from contributors: @maximedion2
<https://github.com/maximedion2> @kylebarron
<https://github.com/kylebarron> @tshauck <https://github.com/tshauck>
@alamb <https://github.com/alamb>
—
Reply to this email directly, view it on GitHub
<#70 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AADXZMIPCDW6W4HWIZGPGC34IETEDAVCNFSM6AAAAACSLXO5ASVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZTOOBWGM4DKNJYGY>
.
You are receiving this because you were mentioned.Message ID:
***@***.***>
|
|
@alamb ha automated review request rejection that's awesome, didn't actually realize it was you from influxDB wouldn't have requested if I did I assume you are quite busy. |
|
Well it wans't quite automated -- I responded to an email github alert :) |
Okay I just merged my PR, I'll take a look at yours probably tomorrow. Ah, right, I don't have a GCS account, I'm more familiar with AWS so I set that up initially (and it's kind of why I was procrastinating about implementing GCS support haha) so it's good that you have it the other way around. I'd be curious about a side by side comparison between AWS and GCS though. Sounds good, I do have a couple good issues to work on, that are not necessarily urgent, I'll create those and you can pick them up if/when you have time (and of course feel free to ask any questions you'd like, happy to discuss). |
|
oh and @xaviernogueira I'm not completely sure what you mean by "raster <> vector stuff", but maybe related to that, part of my plan is to implement spatial functions specialized for points (e.g. a ST_Within where one side is always a point geo and the other is just any type of geo). SedonaDB already implements the traditional spatial stuff in datafusion I think, but I'd like to specialize this crate on raster data from zarr/icechunk, I have a few ideas regarding spatial operations for points, which may or may not be more fun than useful, only way to know is to try! |
|
@maximedion2 sweet, good stuff with the Python bindings. I should have some time this weekend to loop back around and get the merge conflicts resolved and address your review (if you do it by then...no rush!). And yes I really only use GCP although I may change that for my own stuff going forward as its kinda pricey. As by "raster<>vector stuff" yes I am referring to functionality currently in SedonaDB like ZonalStats, so plugging into that via a |
Ah I see. Yeah I was looking at SedonaDB recently, a lot of the stuff is pub so components can be re-used. I think rasters are a WIP, but once I see the internals of how it works in SedonaDB we could certainly add a "RS_fromZarr" or something like that. And it seems things are very modular in datafusion, so we could potentially spin up a SessionContext, register the RS_fromZarr, register whatever UDFs you need from SedonaDB (e.g. ZonalStats), and combine all that so that it's all available in the same query. And then of course we can add whatever custom functionality (ExecutionPlan, UDFs) on top of that. |
| // ============================================================================ | ||
|
|
||
| struct GCSBenchBackend { | ||
| _bucket: String, |
There was a problem hiding this comment.
nit: everything is private unless you say it's public in rust, no need for the underscores for class members.
| } | ||
|
|
||
| async fn cleanup(&self) { | ||
| // Cleanup is handled by the TestFixture Drop implementation |
There was a problem hiding this comment.
Not sure I follow... the test fixture calls this method when it's dropped, but you haven't implemented anything here (but you did for the s3 version)?
Generally speaking though, why do we need this, that gets called in the test fixture drop? why not simply implement drop on the gcs and s3 backends directly?
| @@ -176,6 +176,26 @@ impl ZarrTableUrl { | |||
| .await | |||
| .map_err(|e| DataFusionError::External(Box::new(e)))? | |||
| } | |||
| "gs" => { | |||
There was a problem hiding this comment.
can you also implement the zarr case? it's right above it in the code, it relies on ObjectStore (which has a gcs backend).
also, I was gonna ask why you have a use statement here, but I just saw that that's what I did too apparently... generally speaking I think that's probably not a good pattern, that was just me being sloppy.
but, this has made me realize that I need to properly make s3 and gcs optional, set up the right features, etc... So let's leave it as is for now, I will pick this up after we merge the PR.
|
@xaviernogueira were you planning on picking this up soon? No rush, just want to know when I can start working on making S3/GCS proper features in the crate. |
|
@maximedion2 busy with a sprint at my job right now and getting setup in a new country! If you want to pick it up feel free, I'll definitely circle back at some point soonish |
Ah I see, no problem I have a few other things that I want to work on that are not blocked by this, I'll get started on it and I'll circle back to this PR in a little while, if you're still busy I'll pick it up, but I'll give you more time to take a look if you want. |
|
Thanks for the great discussion. The raster capabilities in SedonaDB are indeed under active development, but they primarily reimplement the functionality already available in SedonaSpark: https://sedona.apache.org/latest/tutorial/raster/ SedonaSpark raster functions have been used by many users in production. Both SedonaDB and SedonaSpark adopt the PostGIS Raster data model, which allows us to remain compatible with PostGIS. |
Ah I see, no problem I have a few other things that I want to work on that are not blocked by this, I'll get started on it and I'll circle back to this PR in a little while, if you're still busy I'll pick it up, but I'll give you more time to take a look if you want.
Thanks for the note! I'll check in with @xaviernogueira, I have several things I want to work on with this project, some ideas I want to try, etc... and I think he was interested in the "raster from zarr" concept, so when he's more available maybe that could be a really cool project to work on. I'll check the docs/tutorial you mentioned in the meantime, I'm not super familiar with rasters so I can definitely use a primer! |
Contribution Context:
Super cool project you have here! I was privately working on my own version of
arrow-zarrbut withoutdatafusion(parquet focused) and with a lot less progress than you. Anyways I gave up on that and am looking to help build things here and be involved in this project, especially around raster<>vector UDFs over time.That said this is my first PR in rust, I am a bit of a beginner, so some of the functionality I'd like to get into is above my pay-grade at the moment. To help my learning I decided to go for some low hanging fruit here.
Changes:
s3_bench.rsto create aCloudStorageBenchBackendtrait inshared.rs-> reduces duplication as additional benchmark backends are added.Results (around 2x slower than s3?):