STAC DataPipes #48

weiji14 · 2022-09-02T19:07:56Z

weiji14
Sep 2, 2022
Maintainer

To enable cloud-native, streaming machine learning data pipelines based on SpatioTemporal Asset Catalogs (STAC)!

A torch DataPipe is a way of doing composition over inheritance. The philosophy is to have each DataPipe do one thing and do it well similar to the UNIX philosophy of pipe-ing one piece of text to another command. The pipe syntax also has parallels with the method chaining way of pandas (see pandas.DataFrame.pipe).

📖 STAC Readers

There are 4 parts as per https://stacspec.org/en/about/stac-spec, and one idea to have individual DataPipes for each of the STAC Item/Catalog/Collection/API as hinted in torchgeo/torchgeo#412 (comment)

PySTACItemReader wrapping pystac.Item.from_file (✨ PySTACItemReaderIterDataPipe for reading STAC Items #46)
PySTACCatalogReader wrapping pystac.Catalog.from_file for static catalogs.
PySTACCollectionReader wrapping pystac.ItemCollection.from_file
PySTACAPISearcher wrapping e.g. pystac_client.Client.search for dynamic catalogs (✨ PySTACAPISearchIterDataPipe to query dynamic STAC Catalogs #59)

See also https://stacspec.org/en/about/stac-spec/

💾 STAC I/O

Coming from the STAC Readers, the STAC objects (Item, ItemCollection, etc) would then need to be read into memory using some I/O library. These I/O libraries would handle the stacking of Assets as mentioned in torchgeo/torchgeo#412 (comment). E.g.

StackstacStacker wrapping stackstac.stack which returns an xarray.DataArray (✨ StackSTACStackerIterDataPipe for stacking STAC items #61)
ODCstacLoader wrapping odc.stac.load which returns an xarray.Dataset

Note: See also opendatacube/odc-stac#54 (comment) for differences between stackstac and odc-stac

🐕‍🦺 STAC services (requiring authentication)

planetary_computer has their STAC catalog at https://planetarycomputer.microsoft.com/api/stac/v1/, and there are some (but not all) Collections which require signing/authentication
radiant-mlhub has their own STAC catalog/API library, as mentioned in Add STACAPI dataset torchgeo/torchgeo#412 (comment)

Note: The authentication/signing can be handled via the parameters and/or modifier parameters in pystac_client.Client.open (I think).

🥤 Example 'DataPipeLine'

graph LR
    subgraph STAC DataPipeLine 1

    A["IterableWrapper (list[url])"] --> B
    B["PySTACItemReader (list[pystac.Item])"] --> C
    C["StackstacStacker (xarray.DataArray)"]

    end

🧑‍🤝‍🧑 Open for contributions

Anyone is welcome to comment on the details (e.g. naming the DataPipes, what else is needed, etc), or open a Pull Request directly to implement a DataPipe (see https://zen3geo.readthedocs.io/en/latest/CONTRIBUTING.html#running-things-locally on getting started)!

One thing to note is that I've designed zen3geo explicitly so that dependencies are optional by default, so if someone doesn't use odc-stac for example, they shouldn't have to install it. Just bear this in mind when you're writing up the code.

Cc @jamesvrt, @rbavery, @KennSmithDS

Originally discussed in torchgeo/torchgeo#412, xref torchgeo/torchgeo#576

weiji14 · 2022-09-21T03:19:19Z

weiji14
Sep 21, 2022
Maintainer Author

Heads up that there's now a MVP (minimal viable pipeline) from STAC API queries to a stackstac-created xarray.DataArray 🎉! I'm drafting a STAC DataPipe walkthrough tutorial at #62 (rendered preview at https://zen3geo--62.org.readthedocs.build/en/62/stacking.html). This will cover a DataPipeLine like so:

graph LR
    subgraph STAC DataPipeLine

    A["IterableWrapper (list[dict])"] --> B
    B["PySTACAPISearcher (list[pystac_client.ItemSearch])"] --> C
    C["Mapper (list[pystac.ItemCollection])"] --> D
    D["StackstacStacker (list[xarray.DataArray])"]

    end

where the steps are:

STAC queries written in the form of a Python dict
The dict queries get sent to pystac_client.Client.search and returns a pystac_client.ItemSearch instance
The pystac_client.ItemSearch gets turned into a pystac.ItemCollection
The pystac.ItemCollection object is passed to stackstac.stack and returns an xarray.DataArray

Hoping to finish this by the end of the week 🤞, and will cut a new v0.5.0 release soon after 😁

3 replies

weiji14 Sep 26, 2022
Maintainer Author

STAC DataPipes are out in v0.5.0! See #68 for changelog 🎉

rbavery Sep 26, 2022

I'll be testing this out and comparing it to a workflow where I'm using stackstac and xbatcher directly. Thanks a bunch for this @weiji14 !

weiji14 Sep 26, 2022
Maintainer Author

Cool, let me know if there are any bugs. I'm gonna start to convert one of my projects to use this too 😀

weiji14 · 2023-04-27T02:01:08Z

weiji14
Apr 27, 2023
Maintainer Author

Note that zen3geo v0.6.0 comes with an XpySTACAssetReader DataPipe for reading STAC assets backed by COG/NetCDF/Zarr files, done in #87. This is essentially a wrapper around xarray.open_dataset(..., engine="stac") (requires xpystac to be installed), and enables reading a pystac.Asset directly into an xarray.Dataset object!

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

STAC DataPipes #48

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Replies: 2 comments 3 replies

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Select a reply

Uh oh!

Uh oh!

STAC DataPipes #48

Uh oh!

Uh oh!

weiji14 Sep 2, 2022 Maintainer

📖 STAC Readers

💾 STAC I/O

🐕‍🦺 STAC services (requiring authentication)

🥤 Example 'DataPipeLine'

🧑‍🤝‍🧑 Open for contributions

Replies: 2 comments · 3 replies

Uh oh!

Uh oh!

weiji14 Sep 21, 2022 Maintainer Author

Uh oh!

weiji14 Sep 26, 2022 Maintainer Author

Uh oh!

rbavery Sep 26, 2022

Uh oh!

weiji14 Sep 26, 2022 Maintainer Author

Uh oh!

Uh oh!

weiji14 Apr 27, 2023 Maintainer Author

weiji14
Sep 2, 2022
Maintainer

Replies: 2 comments 3 replies

weiji14
Sep 21, 2022
Maintainer Author

weiji14 Sep 26, 2022
Maintainer Author

weiji14 Sep 26, 2022
Maintainer Author

weiji14
Apr 27, 2023
Maintainer Author