Skip to content

First invocation of open_dataset takes 3 seconds due to backend entrypoint discovery being slow #10178

Open
@rabernat

Description

@rabernat

What happened?

The first time I open an Xarray dataset--any dataset--it takes around 3 seconds. Any subsequent invocation of open_dataset is relatively much faster.

Here's an example and a profile file.

import xarray as xr

%%prun -D open_dataset.prof
xr.open_dataset("~/.cache/xarray_tutorial_data/69c68be1605878a6c8efdd34d85b4ca1-air_temperature.nc", engine="netcdf4")
# alternatively, but make sure it's already downloaded
# xr.tutorials.open_dataset("air_temperature")

4938479 function calls (4808880 primitive calls) in 3.812 seconds

open_dataset.prof.zip

Image

The bulk of the time is spent in this function

def backends_dict_from_pkg(
entrypoints: list[EntryPoint],
) -> dict[str, type[BackendEntrypoint]]:
backend_entrypoints = {}
for entrypoint in entrypoints:
name = entrypoint.name
try:
backend = entrypoint.load()
backend_entrypoints[name] = backend
except Exception as ex:
warnings.warn(
f"Engine {name!r} loading failed:\n{ex}", RuntimeWarning, stacklevel=2
)
return backend_entrypoints

And specifically, the entrypoint.load() line.

What did you expect to happen?

This is an unacceptable overhead for low-latency applications, e.g. a serverless application that needs to quickly open a dataset. I expect the load time to be in ms for data on disk.

Minimal Complete Verifiable Example

import xarray as xr
xr.tutorials.open_dataset("air_temperature")

MVCE confirmation

  • Minimal example — the example is as focused as reasonably possible to demonstrate the underlying issue in xarray.
  • Complete example — the example is self-contained, including all data and the text of any traceback.
  • Verifiable example — the example copy & pastes into an IPython prompt or Binder notebook, returning the result.
  • New issue — a search of GitHub Issues suggests this is not a duplicate.
  • Recent environment — the issue occurs with the latest version of xarray and its dependencies.

Relevant log output

Anything else we need to know?

No response

Environment

INSTALLED VERSIONS

commit: None
python: 3.12.4 | packaged by conda-forge | (main, Jun 17 2024, 10:23:07) [GCC 12.3.0]
python-bits: 64
OS: Linux
OS-release: 5.10.220-209.869.amzn2.x86_64
machine: x86_64
processor: x86_64
byteorder: little
LC_ALL: C.UTF-8
LANG: C.UTF-8
LOCALE: ('C', 'UTF-8')
libhdf5: 1.14.3
libnetcdf: 4.9.2

xarray: 2025.3.0
pandas: 2.2.2
numpy: 1.26.4
scipy: 1.13.1
netCDF4: 1.6.5
pydap: installed
h5netcdf: 1.3.0
h5py: 3.11.0
zarr: 3.0.6
cftime: 1.6.4
nc_time_axis: 1.4.1
iris: None
bottleneck: 1.4.0
dask: 2024.6.2
distributed: 2024.6.2
matplotlib: 3.8.4
cartopy: 0.23.0
seaborn: 0.13.2
numbagg: 0.8.1
fsspec: 2024.6.0
cupy: None
pint: 0.23
sparse: 0.15.4
flox: 0.10.0
numpy_groupies: 0.11.1
setuptools: 70.1.0
pip: 24.0
conda: None
pytest: 8.2.2
mypy: None
IPython: 8.25.0
sphinx: None

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions