feat(python/sedonadb): Enable zarr read via sedonadb-zarr#916
Conversation
48fb34d to
ebfdeb2
Compare
There was a problem hiding this comment.
Pull request overview
Adds a Python-facing raster loader interface so OutDb rasters (e.g., Zarr chunk anchors) can be materialized by loaders implemented outside the core package, and updates the Rust async raster loader API to support batching.
Changes:
- Change
AsyncRasterLoader::loadto accept a batch of requests and return a batch of results. - Add a Python-backed raster loader bridge in
python/sedonadband register it viaSedonaContext.register(...). - Register a Zarr raster loader from
sedonadb_zarrand extend tests to coverRS_EnsureLoadedover Zarr data.
Reviewed changes
Copilot reviewed 18 out of 19 changed files in this pull request and generated 16 comments.
Show a summary per file
| File | Description |
|---|---|
| rust/sedona/src/context.rs | Updates test loader to the new batched raster loader API. |
| rust/sedona-raster/src/raster_loader.rs | Changes raster loader trait to batched load and updates registry tests. |
| rust/sedona-raster-zarr/src/raster_loader.rs | Adapts Zarr loader to batched load by delegating per-request. |
| rust/sedona-raster-gdal/src/raster_loader.rs | Adapts GDAL loader to batched load and refactors request validation/loading. |
| rust/sedona-raster-functions/src/rs_ensure_loaded.rs | Calls raster loader via new batched API and validates result count. |
| python/sedonadb/src/raster_loader.rs | Introduces Rust↔Python bridge implementing AsyncRasterLoader via PyO3. |
| python/sedonadb/src/lib.rs | Exposes new PyO3 raster-loader types and factory function to Python. |
| python/sedonadb/src/context.rs | Enables sd.register(...) to accept Python raster loaders/wrappers. |
| python/sedonadb/python/sedonadb/raster_loader.py | Adds the Python-side RasterLoader interface + result helpers. |
| python/sedonadb/python/sedonadb/context.py | Includes raster loader interface in supported registration interfaces. |
| python/sedonadb/tests/test_raster_loader.py | Adds tests for Python-backed raster loader registration/invocation. |
| python/sedonadb/Cargo.toml | Adds Rust deps needed for Python raster loader bridge (arrow-buffer, bytes, sedona-raster). |
| python/sedonadb-zarr/src/lib.rs | Adds a Python-exposed Zarr raster loader and zero-copy buffer wrapper. |
| python/sedonadb-zarr/python/sedonadb_zarr/init.py | Registers Zarr raster loader as part of ZarrExtension. |
| python/sedonadb-zarr/tests/test_zarr.py | Adds coverage for RS_EnsureLoaded end-to-end over Zarr. |
| python/sedonadb-zarr/tests/test_zarr_raster_functions.py | Removes xfail and asserts Zarr chunks are materialized. |
| python/sedonadb-zarr/tests/conftest.py | Minor formatting/spacing adjustments in fixtures. |
| python/sedonadb-zarr/Cargo.toml | Adds deps for the new Python-exposed loader/buffer types. |
| Cargo.lock | Locks new Rust dependencies pulled in by the Python crates. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
james-willis
left a comment
There was a problem hiding this comment.
Couple small things inline! Some questions below to help my understanding
2 things im not totally following:
-
why do we need so many py wrappers? If we went the ABI route would it reduce the amount of pythoning that goes on here? But this one lets users implement extensions in python? Is there perf overhead to this approach vs the FFI/ABI approach?
-
What is the advantage of grouping requests?
| if !buffer.is_c_contiguous() { | ||
| return Err(ArrowError::InvalidArgumentError( | ||
| "Buffer must be C-contiguous".to_string(), | ||
| )); | ||
| } |
There was a problem hiding this comment.
should we also check buffer.readonly()? doc string says we dont allow writable buffers
There was a problem hiding this comment.
I removed the comment. It doesn't matter if the buffer happens to be writable (Python does bookkeeping to ensure buffers are not mutated while references are held)
| ], | ||
| ) | ||
| def test_rs_ensure_loaded_with_zarr(tmp_path, numpy_dtype): | ||
| zarr = pytest.importorskip("zarr", minversion="3.0") |
There was a problem hiding this comment.
This should fail if it cant import
There was a problem hiding this comment.
I think wheel tests for python <3.11 will fail if this isn't 'skip'
There was a problem hiding this comment.
can we condition the import on pythong version then? I am afraid of silent failure
There was a problem hiding this comment.
I updated this, hopefully better now!
|
|
||
| // Spawn a blocking task to hold the GIL and call Python | ||
| let results = tokio::task::spawn_blocking(move || { | ||
| Python::with_gil(|py| -> Result<Vec<PyRasterLoadResultData>, ArrowError> { |
There was a problem hiding this comment.
will gil contention impede perf here?
There was a problem hiding this comment.
It will, which is why we can/will move this to FFI for the next release. As long as RS_EnsureLoaded is grouping these requests properly it will be fine in most cases. The implementation on the other side calls allow_threads, which allows other Python code to run while it's waiting for IO.
|
2 things im not totally following:
We can and will move this to an ABI; however, I tried briefly and I couldn't do a good job of it right away so I punted to the next release so we can spend the time it takes to get this right. In the meantime, having this in Python lets us/llms prototype and iterate quickly to try stuff
RS_EnsureLoaded knows out of the gate all of the requests it is going to issue. If the loader implementation can see all of them it can be smarter about the order in which it issues the requests (e.g., three bands from the same tiff don't have to decode a tiff 3 times). This reduces or eliminates the need for a cache in a lot of cases. There are also things that some loaders have to do per request that are expensive, such as holding the GIL or spawning a task/thread. If they have to do this for every element this is expensive; if they can do it every n elements, that cost is amortized. Exactly how many elements that is probably depends on what happens next...right now we're working in batches of 1024, which will probably cause out-of-memory (ideally you'd group by the number of bytes, do what you have to do with the loaded arrays, and then drop them before fetching the next batch). |
This PR allows actually loading pixels from backends that live in another Python package (notably: sedonadb_zarr). We had discussed in previous PRs implementing an FFI for this, but for now this PR just implements a Python interface on both sides and handles passing bytes via Python buffers.
There is also a tweak to the async load request to allow multiple requests to be issued at once, which works better for backends that have to launch a task (and is more realistic...usually the requests come in batches). For reasonable performance, RS_EnsureLoaded would need to be updated to take advantage of issuing more than one request at a time. That is not implemented here.
Example with zarrs:
And we can look at it!