Description
What happened?
The first time I open an Xarray dataset--any dataset--it takes around 3 seconds. Any subsequent invocation of open_dataset is relatively much faster.
Here's an example and a profile file.
import xarray as xr
%%prun -D open_dataset.prof
xr.open_dataset("~/.cache/xarray_tutorial_data/69c68be1605878a6c8efdd34d85b4ca1-air_temperature.nc", engine="netcdf4")
# alternatively, but make sure it's already downloaded
# xr.tutorials.open_dataset("air_temperature")
4938479 function calls (4808880 primitive calls) in 3.812 seconds
The bulk of the time is spent in this function
xarray/xarray/backends/plugins.py
Lines 66 to 79 in 66f6c17
And specifically, the entrypoint.load()
line.
What did you expect to happen?
This is an unacceptable overhead for low-latency applications, e.g. a serverless application that needs to quickly open a dataset. I expect the load time to be in ms for data on disk.
Minimal Complete Verifiable Example
import xarray as xr
xr.tutorials.open_dataset("air_temperature")
MVCE confirmation
- Minimal example — the example is as focused as reasonably possible to demonstrate the underlying issue in xarray.
- Complete example — the example is self-contained, including all data and the text of any traceback.
- Verifiable example — the example copy & pastes into an IPython prompt or Binder notebook, returning the result.
- New issue — a search of GitHub Issues suggests this is not a duplicate.
- Recent environment — the issue occurs with the latest version of xarray and its dependencies.
Relevant log output
Anything else we need to know?
No response
Environment
INSTALLED VERSIONS
commit: None
python: 3.12.4 | packaged by conda-forge | (main, Jun 17 2024, 10:23:07) [GCC 12.3.0]
python-bits: 64
OS: Linux
OS-release: 5.10.220-209.869.amzn2.x86_64
machine: x86_64
processor: x86_64
byteorder: little
LC_ALL: C.UTF-8
LANG: C.UTF-8
LOCALE: ('C', 'UTF-8')
libhdf5: 1.14.3
libnetcdf: 4.9.2
xarray: 2025.3.0
pandas: 2.2.2
numpy: 1.26.4
scipy: 1.13.1
netCDF4: 1.6.5
pydap: installed
h5netcdf: 1.3.0
h5py: 3.11.0
zarr: 3.0.6
cftime: 1.6.4
nc_time_axis: 1.4.1
iris: None
bottleneck: 1.4.0
dask: 2024.6.2
distributed: 2024.6.2
matplotlib: 3.8.4
cartopy: 0.23.0
seaborn: 0.13.2
numbagg: 0.8.1
fsspec: 2024.6.0
cupy: None
pint: 0.23
sparse: 0.15.4
flox: 0.10.0
numpy_groupies: 0.11.1
setuptools: 70.1.0
pip: 24.0
conda: None
pytest: 8.2.2
mypy: None
IPython: 8.25.0
sphinx: None