Skip to content

Conversation

@chengzhuzhang
Copy link
Collaborator

@chengzhuzhang chengzhuzhang commented Oct 8, 2025

The pull request improves performance further based on the initial performance improvement work in #26. When testing 165 year historical runs, I found that the land variables were taking ~18 minutes each vs ~5 seconds for atmosphere variables. The issue was dask lazy evaluation - when area scaling arrays (total_land_area, north_land_area, south_land_area) remained as lazy dask arrays, the multiplication operation triggered loading all data from disk, causing the massive delay.

Solution: Eagerly load both area fields and computed data arrays into memory before performing operations. This ensures all operations work with numpy arrays instead of lazy dask arrays.

Changes:

  • For TOTAL metric variables, call .load() on area fields (valid_area_per_gridcell, area, landfrac) after opening dataset
  • Call .load() on annual average data_array after computation
  • Reduces land variable processing from ~18 minutes to ~5-10 seconds

🤖 Generated with Claude Code

Select one: This pull request is...

  • a bug fix: increment the patch version
  • a small improvement: increment the minor version
  • a new feature: increment the minor version
  • an incompatible (non-backwards compatible) API change: increment the major version

Small Change

  • To merge, I will use "Squash and merge". That is, this change should be a single commit.
  • Logic: I have visually inspected the entire pull request myself.
  • Pre-commit checks: All the pre-commits checks have passed.

Land variables were taking ~18 minutes each vs ~5 seconds for atmosphere
variables. The issue was dask lazy evaluation - when area scaling arrays
(total_land_area, north_land_area, south_land_area) remained as lazy dask
arrays, the multiplication operation triggered loading all data from disk,
causing the massive delay.

Solution: Eagerly load both area fields and computed data arrays into
memory before performing operations. This ensures all operations work with
numpy arrays instead of lazy dask arrays.

Changes:
- For TOTAL metric variables, call .load() on area fields
  (valid_area_per_gridcell, area, landfrac) after opening dataset
- Call .load() on annual average data_array after computation
- Reduces land variable processing from ~18 minutes to ~5-10 seconds

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <[email protected]>
@chengzhuzhang chengzhuzhang marked this pull request as ready for review October 9, 2025 23:37
@chengzhuzhang chengzhuzhang requested a review from forsyth2 October 9, 2025 23:38
@forsyth2 forsyth2 added the semver: small improvement Small improvement (will increment patch version) label Oct 9, 2025
Copy link
Collaborator

@forsyth2 forsyth2 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks reasonable from visual inspection, and I see the GitHub Actions are passing.

Since we're in a heavy testing period for the Unified release anyway, we can check the integration tests results on main after merging / in the next rc.

@chengzhuzhang
Copy link
Collaborator Author

Thanks @forsyth2 .

@chengzhuzhang chengzhuzhang merged commit 2554bd4 into main Oct 10, 2025
5 checks passed
@chengzhuzhang chengzhuzhang deleted the one-more-fix-land-performance branch October 10, 2025 16:15
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

semver: small improvement Small improvement (will increment patch version)

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants