-
Notifications
You must be signed in to change notification settings - Fork 2
Description
What happened?
Overview
I think there is a bottleneck with the DatasetWrapper.dataset attribute, which calls xcdat.open_mfdataset(). The lines below attempt to open all .nc files in the self.directory.
The issue with this approach is that Xarray/xCDAT will attempt to concatenate all of these files into a single Dataset object, which can be slow depending on the number of files, the file sizes, and the shape/structure of the data.
zppy-interfaces/zppy_interfaces/global_time_series/coupled_global_dataset_wrapper.py
Lines 17 to 20 in 460a87f
| # `directory` will be of the form `{case_dir}/post/<component>/glb/ts/monthly/{ts_num_years_str}yr/` | |
| self.dataset: xarray.core.dataset.Dataset = xcdat.open_mfdataset( | |
| f"{directory}*.nc", center_times=True | |
| ) |
Example Code
The example script below (from #19) takes ~85 seconds to open all of the land dataset files, even though we only need the dataset for "FSH" (lines below).
Script used:
from zppy_interfaces.global_time_series.__main__ import main
import sys
import time
sys.argv.extend([
"--use_ocn", "False",
"--input", "/lcrc/group/e3sm2/ac.wlin/E3SMv3/v3.LR.historical_0051",
"--input_subdir", "archive/ocn/hist",
"--moc_file", "mocTimeSeries_1985-2014.nc",
"--case_dir", "/lcrc/group/e3sm/ac.forsyth2/E3SMv3_20250331_try2/v3.LR.historical_0051",
"--experiment_name", "v3.LR.historical_0051",
"--figstr", "v3.LR.historical_0051",
"--color", "Blue",
"--ts_num_years", "30",
"--plots_original", "None",
"--plots_atm", "TREFHT",
"--plots_ice", "None",
# "--plots_lnd", "all",
"--plots_lnd", "FSH",
"--plots_ocn", "None",
"--nrows", "4",
"--ncols", "2",
"--results_dir", "./zi",
"--regions", "glb,n,s",
"--make_viewer", "True",
"--start_yr", "1985",
"--end_yr", "2014"
])
start_time = time.time()
main()
end_time = time.time()
print(f"Execution time: {end_time - start_time:.2f} seconds")Related code before land datasets are opened with call to set_var then DatasetWrapper.dataset.
zppy-interfaces/zppy_interfaces/global_time_series/coupled_global.py
Lines 242 to 249 in 460a87f
| requested_variables.vars_land = set_var( | |
| exp, | |
| "land", | |
| requested_variables.vars_land, | |
| valid_vars, | |
| invalid_vars, | |
| rgn, | |
| ) |
The directory for land dataset is directory = '/lcrc/group/e3sm/ac.forsyth2/E3SMv3_20250331_try2/v3.LR.historical_0051/post/lnd/glb/ts/monthly/30yr/, which contains 353 files and is 1058.67 MB total.
Example Code (Minimum with xCDAT)
Output
Number of .nc files: 353
Total size of .nc files: 1058.67 MB
Time taken to open dataset with all variables: 38.01 seconds
Time taken to open dataset with one variable: 0.61 secondsCode
#%%
import xcdat as xc
import os
import time
directory = '/lcrc/group/e3sm/ac.forsyth2/E3SMv3_20250331_try2/v3.LR.historical_0051/post/lnd/glb/ts/monthly/30yr/'
# Count the number of .nc files in the directory and calculate their total size
# ---------------------------------------------------------------------------
nc_file_count = 0
total_size = 0
for f in os.listdir(directory):
if f.endswith('.nc'):
nc_file_count += 1
total_size += os.path.getsize(os.path.join(directory, f))
print(f"Number of .nc files: {nc_file_count}")
print(f"Total size of .nc files: {total_size / (1024**2):.2f} MB")
# Measure time taken to open dataset with all variables and one variable
# ---------------------------------------------------------------------------
start_time = time.time()
ds_all_vars = xc.open_mfdataset(directory, center_times=True)
end_time = time.time()
print(f"Time taken to open dataset with all variables: {end_time - start_time:.2f} seconds")
start_time = time.time()
ds_one_var = xc.open_mfdataset(f"{directory}/FSH*.nc", center_times=True)
end_time = time.time()
print(f"Time taken to open dataset with one variable: {end_time - start_time:.2f} seconds")Possible Solution
- Add
variablesparameter toDatasetWrapper.__init__-- list of required variables to open with xCDAT - Create a list of
filepathsusingdirectoryandvariables - Pass
filepathstoxcdat.open_mfdataset()
What machine were you running on?
Chrysalis
Environment
Latest main with dev env
What command did you run?
Provided already above.