Skip to content

feat(python/sedonadb): Enable zarr read via sedonadb-zarr#916

Merged
paleolimbot merged 38 commits into
apache:mainfrom
paleolimbot:raster-loader-plugin2
Jun 19, 2026
Merged

feat(python/sedonadb): Enable zarr read via sedonadb-zarr#916
paleolimbot merged 38 commits into
apache:mainfrom
paleolimbot:raster-loader-plugin2

Conversation

@paleolimbot

@paleolimbot paleolimbot commented Jun 6, 2026

Copy link
Copy Markdown
Member

This PR allows actually loading pixels from backends that live in another Python package (notably: sedonadb_zarr). We had discussed in previous PRs implementing an FFI for this, but for now this PR just implements a Python interface on both sides and handles passing bytes via Python buffers.

There is also a tweak to the async load request to allow multiple requests to be issued at once, which works better for backends that have to launch a task (and is more realistic...usually the requests come in batches). For reasonable performance, RS_EnsureLoaded would need to be updated to take advantage of issuing more than one request at a time. That is not implemented here.

Example with zarrs:

import matplotlib.pyplot as plt
import numpy as np

import sedona.db
import sedonadb_zarr

sd = sedona.db.connect()
sd.options.interactive = True
sd.register(sedonadb_zarr.ZarrExtension())

url = "https://atlantis-vis-o.s3-ext.jc.rl.ac.uk/hurricanes/era5/florence"
t = sd.read(url, format="zarr", options={"arrays": ["velocity"]})

tab = t.limit(1).select(
    raster=t.raster.funcs.rs_ensureloaded(),
    env=t.raster.funcs.rs_envelope().geo.set_srid(4326)
).to_arrow_table()

And we can look at it!

r = tab["raster"][0].as_py()
data = r.bands[0].to_numpy()
env = tab["env"][0].to_shapely()

# Squeeze extra dimensions to get 2D array
data_2d = np.squeeze(data)

fig, ax = plt.subplots(figsize=(10, 8))
im = ax.imshow(data_2d, extent=[env.bounds[0], env.bounds[2], env.bounds[1], env.bounds[3]], origin='lower')
ax.set_xlabel("Longitude")
ax.set_ylabel("Latitude")
ax.set_title("Velocity Raster")
plt.colorbar(im, ax=ax, label="Velocity")
plt.show()
output

@github-actions github-actions Bot requested a review from zhangfengcdt June 6, 2026 03:12

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds a Python-facing raster loader interface so OutDb rasters (e.g., Zarr chunk anchors) can be materialized by loaders implemented outside the core package, and updates the Rust async raster loader API to support batching.

Changes:

  • Change AsyncRasterLoader::load to accept a batch of requests and return a batch of results.
  • Add a Python-backed raster loader bridge in python/sedonadb and register it via SedonaContext.register(...).
  • Register a Zarr raster loader from sedonadb_zarr and extend tests to cover RS_EnsureLoaded over Zarr data.

Reviewed changes

Copilot reviewed 18 out of 19 changed files in this pull request and generated 16 comments.

Show a summary per file
File Description
rust/sedona/src/context.rs Updates test loader to the new batched raster loader API.
rust/sedona-raster/src/raster_loader.rs Changes raster loader trait to batched load and updates registry tests.
rust/sedona-raster-zarr/src/raster_loader.rs Adapts Zarr loader to batched load by delegating per-request.
rust/sedona-raster-gdal/src/raster_loader.rs Adapts GDAL loader to batched load and refactors request validation/loading.
rust/sedona-raster-functions/src/rs_ensure_loaded.rs Calls raster loader via new batched API and validates result count.
python/sedonadb/src/raster_loader.rs Introduces Rust↔Python bridge implementing AsyncRasterLoader via PyO3.
python/sedonadb/src/lib.rs Exposes new PyO3 raster-loader types and factory function to Python.
python/sedonadb/src/context.rs Enables sd.register(...) to accept Python raster loaders/wrappers.
python/sedonadb/python/sedonadb/raster_loader.py Adds the Python-side RasterLoader interface + result helpers.
python/sedonadb/python/sedonadb/context.py Includes raster loader interface in supported registration interfaces.
python/sedonadb/tests/test_raster_loader.py Adds tests for Python-backed raster loader registration/invocation.
python/sedonadb/Cargo.toml Adds Rust deps needed for Python raster loader bridge (arrow-buffer, bytes, sedona-raster).
python/sedonadb-zarr/src/lib.rs Adds a Python-exposed Zarr raster loader and zero-copy buffer wrapper.
python/sedonadb-zarr/python/sedonadb_zarr/init.py Registers Zarr raster loader as part of ZarrExtension.
python/sedonadb-zarr/tests/test_zarr.py Adds coverage for RS_EnsureLoaded end-to-end over Zarr.
python/sedonadb-zarr/tests/test_zarr_raster_functions.py Removes xfail and asserts Zarr chunks are materialized.
python/sedonadb-zarr/tests/conftest.py Minor formatting/spacing adjustments in fixtures.
python/sedonadb-zarr/Cargo.toml Adds deps for the new Python-exposed loader/buffer types.
Cargo.lock Locks new Rust dependencies pulled in by the Python crates.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread rust/sedona-raster/src/raster_loader.rs
Comment thread rust/sedona-raster-zarr/src/raster_loader.rs
Comment thread rust/sedona-raster-gdal/src/raster_loader.rs
Comment thread rust/sedona-raster-gdal/src/raster_loader.rs
Comment thread rust/sedona/src/context.rs
Comment thread python/sedonadb/src/raster_loader.rs
Comment thread rust/sedona-raster-functions/src/rs_ensure_loaded.rs
Comment thread python/sedonadb-zarr/python/sedonadb_zarr/__init__.py
Comment thread python/sedonadb-zarr/python/sedonadb_zarr/__init__.py
Comment thread rust/sedona-raster/src/raster_loader.rs
@paleolimbot paleolimbot marked this pull request as ready for review June 18, 2026 14:07

@james-willis james-willis left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Couple small things inline! Some questions below to help my understanding

2 things im not totally following:

  1. why do we need so many py wrappers? If we went the ABI route would it reduce the amount of pythoning that goes on here? But this one lets users implement extensions in python? Is there perf overhead to this approach vs the FFI/ABI approach?

  2. What is the advantage of grouping requests?

Comment on lines +254 to +258
if !buffer.is_c_contiguous() {
return Err(ArrowError::InvalidArgumentError(
"Buffer must be C-contiguous".to_string(),
));
}

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should we also check buffer.readonly()? doc string says we dont allow writable buffers

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I removed the comment. It doesn't matter if the buffer happens to be writable (Python does bookkeeping to ensure buffers are not mutated while references are held)

Comment thread python/sedonadb-zarr/tests/test_zarr.py Outdated
],
)
def test_rs_ensure_loaded_with_zarr(tmp_path, numpy_dtype):
zarr = pytest.importorskip("zarr", minversion="3.0")

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should fail if it cant import

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think wheel tests for python <3.11 will fail if this isn't 'skip'

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we condition the import on pythong version then? I am afraid of silent failure

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I updated this, hopefully better now!


// Spawn a blocking task to hold the GIL and call Python
let results = tokio::task::spawn_blocking(move || {
Python::with_gil(|py| -> Result<Vec<PyRasterLoadResultData>, ArrowError> {

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

will gil contention impede perf here?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It will, which is why we can/will move this to FFI for the next release. As long as RS_EnsureLoaded is grouping these requests properly it will be fine in most cases. The implementation on the other side calls allow_threads, which allows other Python code to run while it's waiting for IO.

@paleolimbot

Copy link
Copy Markdown
Member Author

2 things im not totally following:

why do we need so many py wrappers? If we went the ABI route would it reduce the amount of pythoning that goes on here? But this one lets users implement extensions in python? Is there perf overhead to this approach vs the FFI/ABI approach?

We can and will move this to an ABI; however, I tried briefly and I couldn't do a good job of it right away so I punted to the next release so we can spend the time it takes to get this right. In the meantime, having this in Python lets us/llms prototype and iterate quickly to try stuff

What is the advantage of grouping requests?

RS_EnsureLoaded knows out of the gate all of the requests it is going to issue. If the loader implementation can see all of them it can be smarter about the order in which it issues the requests (e.g., three bands from the same tiff don't have to decode a tiff 3 times). This reduces or eliminates the need for a cache in a lot of cases.

There are also things that some loaders have to do per request that are expensive, such as holding the GIL or spawning a task/thread. If they have to do this for every element this is expensive; if they can do it every n elements, that cost is amortized. Exactly how many elements that is probably depends on what happens next...right now we're working in batches of 1024, which will probably cause out-of-memory (ideally you'd group by the number of bytes, do what you have to do with the loaded arrays, and then drop them before fetching the next batch).

@paleolimbot paleolimbot added this to the 0.4.0 milestone Jun 18, 2026
@paleolimbot paleolimbot merged commit 2989112 into apache:main Jun 19, 2026
25 of 27 checks passed
@paleolimbot paleolimbot deleted the raster-loader-plugin2 branch June 19, 2026 02:18
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants