Fix land variable performance issue by eagerly loading data

chengzhuzhang · claude · chengzhuzhang · commit 229d6da5a64d · 2025-10-08T15:37:32.000-05:00
Land variables were taking ~18 minutes each vs ~5 seconds for atmosphere variables. The issue was dask lazy evaluation - when area scaling arrays (total_land_area, north_land_area, south_land_area) remained as lazy dask arrays, the multiplication operation triggered loading all data from disk, causing the massive delay. Solution: Eagerly load both area fields and computed data arrays into memory before performing operations. This ensures all operations work with numpy arrays instead of lazy dask arrays. Changes: - For TOTAL metric variables, call .load() on area fields (valid_area_per_gridcell, area, landfrac) after opening dataset - Call .load() on annual average data_array after computation - Reduces land variable processing from ~18 minutes to ~5-10 seconds 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>
diff --git a/zppy_interfaces/global_time_series/coupled_global/utils.py b/zppy_interfaces/global_time_series/coupled_global/utils.py
@@ -178,11 +178,23 @@ def process_variable(
         # 2. Load only this variable's data
         dataset = xcdat.open_mfdataset(file_paths, center_times=True)
 
+        # For TOTAL metrics, eagerly load area fields to avoid lazy computation issues
+        if var.metric == Metric.TOTAL:
+            if "valid_area_per_gridcell" in dataset:
+                dataset["valid_area_per_gridcell"].load()
+            if "area" in dataset:
+                dataset["area"].load()
+            if "landfrac" in dataset:
+                dataset["landfrac"].load()
+
         try:
             # 3. Compute annual average
             annual_dataset = dataset.temporal.group_average(var.variable_name, "year")
             data_array = annual_dataset.data_vars[var.variable_name]
 
+            # Eagerly load the result to avoid lazy computation issues
+            data_array.load()
+
             # 4. Apply area scaling if needed
             data_array = apply_scaling(data_array, var.metric, dataset)