Skip to content

[Bug]: Potential performance bottleneck with DatasetWrapper.dataset #21

@tomvothecoder

Description

@tomvothecoder

What happened?

Overview

I think there is a bottleneck with the DatasetWrapper.dataset attribute, which calls xcdat.open_mfdataset(). The lines below attempt to open all .nc files in the self.directory.

The issue with this approach is that Xarray/xCDAT will attempt to concatenate all of these files into a single Dataset object, which can be slow depending on the number of files, the file sizes, and the shape/structure of the data.

# `directory` will be of the form `{case_dir}/post/<component>/glb/ts/monthly/{ts_num_years_str}yr/`
self.dataset: xarray.core.dataset.Dataset = xcdat.open_mfdataset(
f"{directory}*.nc", center_times=True
)

Example Code

The example script below (from #19) takes ~85 seconds to open all of the land dataset files, even though we only need the dataset for "FSH" (lines below).

Script used:

from zppy_interfaces.global_time_series.__main__ import main
import sys
import time

sys.argv.extend([
    "--use_ocn", "False",
    "--input", "/lcrc/group/e3sm2/ac.wlin/E3SMv3/v3.LR.historical_0051",
    "--input_subdir", "archive/ocn/hist",
    "--moc_file", "mocTimeSeries_1985-2014.nc",
    "--case_dir", "/lcrc/group/e3sm/ac.forsyth2/E3SMv3_20250331_try2/v3.LR.historical_0051",
    "--experiment_name", "v3.LR.historical_0051",
    "--figstr", "v3.LR.historical_0051",
    "--color", "Blue",
    "--ts_num_years", "30",
    "--plots_original", "None",
    "--plots_atm", "TREFHT",
    "--plots_ice", "None",
    # "--plots_lnd", "all",
    "--plots_lnd", "FSH",
    "--plots_ocn", "None",
    "--nrows", "4",
    "--ncols", "2",
    "--results_dir", "./zi",
    "--regions", "glb,n,s",
    "--make_viewer", "True",
    "--start_yr", "1985",
    "--end_yr", "2014"
])

start_time = time.time()
main()
end_time = time.time()

print(f"Execution time: {end_time - start_time:.2f} seconds")

Related code before land datasets are opened with call to set_var then DatasetWrapper.dataset.

requested_variables.vars_land = set_var(
exp,
"land",
requested_variables.vars_land,
valid_vars,
invalid_vars,
rgn,
)

The directory for land dataset is directory = '/lcrc/group/e3sm/ac.forsyth2/E3SMv3_20250331_try2/v3.LR.historical_0051/post/lnd/glb/ts/monthly/30yr/, which contains 353 files and is 1058.67 MB total.

Example Code (Minimum with xCDAT)

Output

Number of .nc files: 353
Total size of .nc files: 1058.67 MB

Time taken to open dataset with all variables: 38.01 seconds
Time taken to open dataset with one variable: 0.61 seconds

Code

#%%
import xcdat as xc
import os
import time

directory = '/lcrc/group/e3sm/ac.forsyth2/E3SMv3_20250331_try2/v3.LR.historical_0051/post/lnd/glb/ts/monthly/30yr/'

# Count the number of .nc files in the directory and calculate their total size
# ---------------------------------------------------------------------------
nc_file_count = 0
total_size = 0
for f in os.listdir(directory):
    if f.endswith('.nc'):
        nc_file_count += 1
        total_size += os.path.getsize(os.path.join(directory, f))

print(f"Number of .nc files: {nc_file_count}")
print(f"Total size of .nc files: {total_size / (1024**2):.2f} MB")

# Measure time taken to open dataset with all variables and one variable
# ---------------------------------------------------------------------------
start_time = time.time()
ds_all_vars = xc.open_mfdataset(directory, center_times=True)
end_time = time.time()
print(f"Time taken to open dataset with all variables: {end_time - start_time:.2f} seconds")

start_time = time.time()
ds_one_var = xc.open_mfdataset(f"{directory}/FSH*.nc", center_times=True)
end_time = time.time()
print(f"Time taken to open dataset with one variable: {end_time - start_time:.2f} seconds")

Possible Solution

  • Add variables parameter to DatasetWrapper.__init__ -- list of required variables to open with xCDAT
  • Create a list of filepaths using directory and variables
  • Pass filepaths to xcdat.open_mfdataset()

What machine were you running on?

Chrysalis

Environment

Latest main with dev env

What command did you run?

Provided already above.

What stack trace are you encountering?

Metadata

Metadata

Assignees

No one assigned

    Labels

    priority: highHigh priority task (for next release)semver: bugBug fix (will increment patch version)

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions