-
Notifications
You must be signed in to change notification settings - Fork 7
Improve clouds computation #201
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
This PR improves the cloud optics graph functionality by adding the allow_rechunk parameter to the compute cloud functions and updating the example notebook to reflect improved memory handling and performance adjustments.
- Added allow_rechunk=True in the cloud optics function calls for better handling of large outputs.
- Revised the example notebook to modify chunk sizes, include a markdown note about Dask version requirements, and update the Dask cluster setup.
Reviewed Changes
Copilot reviewed 3 out of 3 changed files in this pull request and generated no comments.
| File | Description |
|---|---|
| pyrte_rrtmgp/rrtmgp_cloud_optics.py | Added allow_rechunk parameter to compute_cloud_optics calls. |
| examples/dyamond_clouds/example.ipynb | Updated chunk sizes, added Dask version recommendation, and cluster configuration changes. |
Comments suppressed due to low confidence (2)
examples/dyamond_clouds/example.ipynb:79
- [nitpick] Consider clarifying this instruction by briefly noting what issues may arise with older versions of Dask. This can help users understand the necessity of the version requirement.
For avoiding memory issues please use dask version 2025.3.0 or higher. A [fix](https://docs.dask.org/en/stable/changelog.html#v2025-3-0) for the apply_ufunc was included in it that solve the memory issues.
pyrte_rrtmgp/rrtmgp_cloud_optics.py:174
- Verify that adding the 'allow_rechunk' parameter does not introduce unintended behavior in parallel execution. If needed, update tests and documentation to reflect its expected impact.
allow_rechunk=True,
49fdb7e to
8108b74
Compare
/Users/brendancol/miniconda3/envs/pyrte-312/lib/python3.12/site-packages/xarray/namedarray/core.py:264: UserWarning: Duplicate dimension names present: dimensions {'ncontact'} appear more than once in dims=('nf', 'ncontact', 'ncontact'). We do not yet support duplicate dimension names, but we do allow initial construction of the object. We recommend you rename the dims immediately to become distinct, as most xarray functionality is likely to fail silently if you do not. To rename the dimensions you will need to set the ``.dims`` attribute of each variable, ``e.g. var.dims=('x0', 'x1')``.
self._dims = self._parse_dimensions(dims) |
for more information, see https://pre-commit.ci
for more information, see https://pre-commit.ci
|
@brendancol I was able to make the processing viable using map_blocks in a higher level, making the same in lower level functions did not had much effect. Some other factors that slow down the process are writing to the disk, or having HUGE outputs, aggregating them in the way they will be used helps with that. |
|
@sehnem thanks for the update. I pull and run locally and get back with any comments. |
…tion for 7 worker dask cluser and process_chunk refactor
|
@sehnem @brendancol Should this PR get merged? How will we handle the VERY LARGE DATA required ? |
|
@RobertPincus I have no experience with large files in git repositories, it make clones heavy, but I saw once a repository where the big files were not cloned by default, I think that it is this feature. Probably Brendan knows better. |
Included
allow_rechunkto the compute cloud functions, as the output size gets very large and dask 2025.3 included a fixes for these cases.I was able to run the full examples on 16Gb of RAM, so it should not be an issue anymore.