Skip to content

Conversation

@danielfromearth
Copy link
Collaborator

@danielfromearth danielfromearth commented Jan 8, 2025

This change adds a tutorial notebook, which shows how to work with TEMPO Level-3 data using the new earthaccess.open_virtual_mfdataset() functionality.
Addresses #903.

Pull Request (PR) draft checklist - click to expand
  • Please review our
    contributing documentation
    before getting started.
  • Populate a descriptive title. For example, instead of "Updated README.md", use a
    title such as "Add testing details to the contributor section of the README".
    Example PRs: #763
  • Populate the body of the pull request with:
  • Update CHANGELOG.md with details about your change in a section titled
    ## Unreleased. If such a section does not exist, please create one. Follow
    Common Changelog for your additions.
    Example PRs: #763
  • [n/a] Update the documentation and/or the README.md with details of changes to the
    earthaccess interface, if any. Consider new environment variables, function names,
    decorators, etc.

Click the "Ready for review" button at the bottom of the "Conversation" tab in GitHub
once these requirements are fulfilled. Don't worry if you see any test failures in
GitHub at this point!

Pull Request (PR) merge checklist - click to expand

Please do your best to complete these requirements! If you need help with any of these
requirements, you can ping the @nsidc/earthaccess-support team in a comment and we
will help you out!

  • [n/a] Add unit tests for any new features.
  • Apply formatting and linting autofixes. You can add a GitHub comment in this Pull
    Request containing "pre-commit.ci autofix" to automate this.
  • Ensure all automated PR checks (seen at the bottom of the "conversation" tab) pass.
  • Get at least one approving review.

📚 Documentation preview 📚: https://earthaccess--924.org.readthedocs.build/en/924/

@danielfromearth danielfromearth linked an issue Jan 8, 2025 that may be closed by this pull request
@github-actions
Copy link

github-actions bot commented Jan 8, 2025

Binder 👈 Launch a binder notebook on this branch for commit 63bae71

I will automatically update this comment whenever this PR is modified

Binder 👈 Launch a binder notebook on this branch for commit aed618c

Binder 👈 Launch a binder notebook on this branch for commit 34ccf9b

Binder 👈 Launch a binder notebook on this branch for commit 98ee876

Binder 👈 Launch a binder notebook on this branch for commit b27ec52

Binder 👈 Launch a binder notebook on this branch for commit c921e3c

Binder 👈 Launch a binder notebook on this branch for commit 31b3caa

Binder 👈 Launch a binder notebook on this branch for commit a5572e1

Binder 👈 Launch a binder notebook on this branch for commit 9b6dfda

Binder 👈 Launch a binder notebook on this branch for commit 14bc121

Binder 👈 Launch a binder notebook on this branch for commit 40c8a43

Binder 👈 Launch a binder notebook on this branch for commit 88ddaa0

Binder 👈 Launch a binder notebook on this branch for commit 2ccc2cf

Binder 👈 Launch a binder notebook on this branch for commit b8483c0

Binder 👈 Launch a binder notebook on this branch for commit eb02ef9

Binder 👈 Launch a binder notebook on this branch for commit 1ef7ed7

Binder 👈 Launch a binder notebook on this branch for commit b30116c

Binder 👈 Launch a binder notebook on this branch for commit c62dab5

Binder 👈 Launch a binder notebook on this branch for commit 3c354e5

@betolink

This comment was marked as resolved.

@danielfromearth

This comment was marked as resolved.

@danielfromearth danielfromearth added the impact: documentation Improvements or additions to documentation label Jan 24, 2025
@danielfromearth danielfromearth self-assigned this Jan 24, 2025
@betolink

This comment was marked as resolved.

@danielfromearth

This comment was marked as resolved.

@ayushnag

This comment was marked as resolved.

@danielfromearth

This comment was marked as resolved.

@betolink

This comment was marked as resolved.

@danielfromearth

This comment was marked as outdated.

@danielfromearth
Copy link
Collaborator Author

danielfromearth commented Apr 4, 2025

Note: discussion of the data-related issue for the tutorial when using TEMPO Level-2 data has been happening over in zarr-developers/VirtualiZarr#487.

@danielfromearth
Copy link
Collaborator Author

danielfromearth commented Apr 11, 2025

Alright, I've changed tactics for the short-term due to various challenges (see previous comments above) of opening Level-2 TEMPO data as a virtual dataset.

New notebook

With plans to still work on that, I've decided to create a new draft notebook that works well with Level-3 data from TEMPO. See it here. This notebook demonstrates working with a year's worth of TEMPO Level-3 data (it is 4,867 granules), including:

  • loading several netCDF groups using earthaccess.open_virtual_mfdataset with load=True so they are indexed
  • merging the groups into one Dataset
  • computing spatial and temporal means for a subset of the data, and
  • creating plots of the results.

Timings

I've included %%time markings on many of the notebook cells to indicate the performance along the way.
The most substantial steps took:

  • ~8 min. to use earthaccess.open_virtual_mfdataset() for opening three netCDF groups from the year's worth of granules.
  • ~9 min. to compute mean over time to create a plot

Would love to hear initial thoughts/comments/questions/suggestions on this!

@ayushnag, @betolink, @battistowx, @TomNicholas

@danielfromearth danielfromearth marked this pull request as ready for review April 15, 2025 15:02
@danielfromearth danielfromearth changed the title initial draft of TEMPO virtual dataset tutorial Add notebook demonstrating workflow with TEMPO Level 3 data as a virtual dataset Apr 15, 2025
@danielfromearth
Copy link
Collaborator Author

danielfromearth commented Apr 15, 2025

I think this looks okay now, but could use other pairs of eyes on it. As part of reviewing this, I would appreciate any suggestions or comments — whether on content, formatting, presentation, etc!

@danielfromearth
Copy link
Collaborator Author

I've shortened the notebook slightly by removing the %%time timings, since performance is being benchmarked in discussion #987 and PR #989.

@TomNicholas
Copy link

IIUC, then because you're passing load=True, then at no point is this code currently creating a "virtual" dataset.

@danielfromearth
Copy link
Collaborator Author

danielfromearth commented Apr 21, 2025

IIUC, then because you're passing load=True, then at no point is this code currently creating a "virtual" dataset.

Hmmm, @TomNicholas, I think you are correct as far as what is returned from the earthaccess.open_virtual_mfdataset() method, but I think there is still a virtual dataset being created temporarily. The way I understand dmrpp_zarr.py: regardless of whether load=True or load=False, the code creates a virtual dataset as an intermediate step, on line 112 in dmrpp_zarr.py, before converting them to kerchunk references and passing them to xarray in the load block, lines 129–131. Does that sound/look right?

Either way, this may be a place where there could be improvements in documentation to make this clearer.

@TomNicholas
Copy link

I think that's right, yes. But that does mean that in this notebook the user never sees a virtual dataset. It's purely an internal optimization at opening-time by earthaccess.

FWIW this relates to my proposal that what earthaccess should be used for is to generate Icechunk stores of interest, such as one for all TEMPO Level 3 data (containing virtual chunks), that people then open directly. See #956. In that paradigm you use basically the same notebook to create an actually virtual dataset, commit that to Icechunk, then tell any users who want to use TEMPO Level 3 data to simply open that Icechunk store.

@ayushnag
Copy link
Collaborator

ayushnag commented Apr 23, 2025

@TomNicholas @danielfromearth yes the load=True param does create a temporary virtual reference file and then immediately loads it with kerchunk. This is because the function targets assumes the user is making this request for the first time and the combined manifest file needs to be generated first. However if you already have the manifest, you can avoid all these steps entirely and just load with kerchunk/virtualizarr.

This was implemented before ManifestStore was added to virtualizarr and that should be a much cleaner way of loading the dataset. Basically we can get rid of the load param since the user can easily access data once they have the virtual xarray dataset.

Also agree with the point that really the best way of doing this is some kind of combined Icechunk store for each collection that is constantly updated as data comes in.

@danielfromearth
Copy link
Collaborator Author

Well, in the short term, without Icechunk stores in Earthdata, do folks think it would still be beneficial to have this tutorial notebook here in the earthaccess repository?

@TomNicholas, @ayushnag, do you think some wording changes — to avoid confusion regarding the meaning of 'virtual dataset' — would suffice for now?

Would folks rather this demonstration notebook be put in a place outside of earthaccess, for example in the Earthdata Cloud Cookbook, or in the ASDC Data and User Services page?

@battistowx
Copy link
Collaborator

battistowx commented May 2, 2025

I think in the longer-term sense, it would probably be best if we had another similar notebook with icechunk and virtualizarr methods in the Cloud Cookbook, and this notebook could go in the earthaccess docs. We can elaborate much more on icechunk in the cloud cookbook notebook too.

battistowx
battistowx previously approved these changes May 14, 2025
@danielfromearth
Copy link
Collaborator Author

danielfromearth commented May 21, 2025

alright, I've tightened up the text and presentation a bit more, and thanks @betolink for the help with the Markdown fix. The notebook now uses a smaller example of a week's worth of data so that the notebook cells run 15–30 seconds each during the CI docs build, instead of multiple minutes each.

Reviewers: do you think it is ready for approval?

Copy link
Collaborator

@battistowx battistowx left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is looking great!

@danielfromearth danielfromearth merged commit 7f6bc6e into main May 23, 2025
11 checks passed
@danielfromearth danielfromearth deleted the issue-903-tutorial-for-open_virtual_dataset-using-TEMPO-data branch May 23, 2025 13:55
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

impact: documentation Improvements or additions to documentation

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Add tutorial notebook for open_virtual_dataset

6 participants