Description
What happened?
I have found an example where open_zarr
appears to ignore the use_zarr_fill_value_as_mask=True
argument.
What did you expect to happen?
Background: Zarr Arrays have a fill_value
property. This is used to determine the value of any "uninitialized" part of an array, e.g. chunks that have never been written. Zarr itself has no concept of missing data or masked data. This parameter was optional in Zarr Python 2 and required in Zarr Python 3.
When we implemented Zarr 3 support, we also changed the behavior of Zarr fill values.
Before Zarr 3, the Zarr array fill_value
was optional, and Xarray used this field analogously to NetCDF's special _FillValue
attribute. The value of the underlying Zarr array fill_value
was used to apply a mask to the array when decoding it.
After Zarr 3, the Zarr array fill_value
became mandatory, and we could no longer keep that behavior because it triggered automatic masking of data. So an integer array with fill_value=0
(the default) would be coerced to float in order to apply the mask and turn all 0s to nans.
Instead, we added an option use_zarr_fill_value_as_mask
, to toggle this behavior. When set to False
, we just use a regular _FillValue
attribute to store the sentinel value for the mask.
xarray/xarray/backends/zarr.py
Lines 874 to 883 in 729c4fa
In the example below, I expected that setting use_zarr_fill_value_as_mask=True
would produce the "old" behavior, but instead it appears to be ignored.
A separate but related issue is that the actual Zarr fill_value
no longer appears in the encoding anywhere. That feels like a problem.
Minimal Complete Verifiable Example
import zarr
import xarray as xr
import numpy as np
# create a dataset with a different _FillValue attr and fill_value encoding
ds = xr.DataArray([1, 2, 3], attrs={"_FillValue": 1}, dims="x").to_dataset(name="foo")
ds.foo.encoding = {"fill_value": -99}
# write to zarr
store = zarr.storage.MemoryStore()
ds.to_zarr(store, zarr_format=3, consolidated=False)
# check that the fill_value encoding was propagated
za = zarr.open(store, path="foo")
assert za.fill_value == -99
# now resize the Zarr array to create true missing data
za.resize((4,))
# open this back up in xarray
# scenario 1: no mask_and_scale at all
# use_zarr_fill_value_as_mask=False should not mask the
ds1 = xr.open_zarr(store, consolidated=False, zarr_format=3, chunks=None, mask_and_scale=False)
# ✅ no mask has been applied
np.testing.assert_equal(ds1.foo.values, [1, 2, 3, -99])
# _FillValue still in attrs
assert ds1.foo.attrs['_FillValue'] == 1
# note that that fill_value doesn't appear in encoding anywhere
assert "fill_value" not in ds1.foo.encoding
# scenario 2: default for Zarr 3 - mask_and_scale=True, use_zarr_fill_value_as_mask=False
ds2 = xr.open_zarr(store, consolidated=False, zarr_format=3, chunks=None, use_zarr_fill_value_as_mask=False)
# ✅ mask applied correctly
np.testing.assert_equal(ds2.foo.values, [np.nan, 2, 3, -99])
# _FillValue moved from attrs to encoding
assert ds2.foo.encoding['_FillValue'] == 1
assert "_FillValue" not in ds2.foo.attrs
# still no sign of -99
assert "fill_value" not in ds2.foo.encoding
assert "fill_value" not in ds2.foo.attrs
# scenario 3: use_zarr_fill_value_as_mask=True
ds3 = xr.open_zarr(store, consolidated=False, zarr_format=3, chunks=None, use_zarr_fill_value_as_mask=True)
# ❌ mask not applied correctly
try:
np.testing.assert_equal(ds2.foo.values, [1, 2, 3, np.nan])
except AssertionError:
# the mask was not applied correctly
print("got", ds2.foo.values)
# _FillValue is in both encoding and attrs!
print("encoding", ds2.foo.encoding)
print("attrs", ds1.foo.attrs)
MVCE confirmation
- Minimal example — the example is as focused as reasonably possible to demonstrate the underlying issue in xarray.
- Complete example — the example is self-contained, including all data and the text of any traceback.
- Verifiable example — the example copy & pastes into an IPython prompt or Binder notebook, returning the result.
- New issue — a search of GitHub Issues suggests this is not a duplicate.
- Recent environment — the issue occurs with the latest version of xarray and its dependencies.
Relevant log output
got [ nan 2. 3. -99.]
encoding {'chunks': (3,), 'preferred_chunks': {'x': 3}, 'compressors': (ZstdCodec(level=0, checksum=False),), 'filters': (), 'shards': None, 'serializer': BytesCodec(endian=<Endian.little: 'little'>), '_FillValue': 1, 'dtype': dtype('int64')}
attrs {'_FillValue': 1}
Anything else we need to know?
No response
Environment
INSTALLED VERSIONS
------------------
commit: None
python: 3.12.4 | packaged by conda-forge | (main, Jun 17 2024, 10:23:07) [GCC 12.3.0]
python-bits: 64
OS: Linux
OS-release: 5.10.220-209.869.amzn2.x86_64
machine: x86_64
processor: x86_64
byteorder: little
LC_ALL: C.UTF-8
LANG: C.UTF-8
LOCALE: ('C', 'UTF-8')
libhdf5: 1.14.3
libnetcdf: 4.9.2
xarray: 2025.3.1
pandas: 2.2.2
numpy: 1.26.4
scipy: 1.13.1
netCDF4: 1.6.5
pydap: installed
h5netcdf: 1.3.0
h5py: 3.11.0
zarr: 3.0.7
cftime: 1.6.4
nc_time_axis: 1.4.1
iris: None
bottleneck: 1.4.0
dask: 2024.6.2
distributed: 2024.6.2
matplotlib: 3.8.4
cartopy: 0.23.0
seaborn: 0.13.2
numbagg: 0.8.1
fsspec: 2024.6.0
cupy: None
pint: 0.23
sparse: 0.15.4
flox: 0.9.9
numpy_groupies: 0.11.1
setuptools: 70.1.0
pip: 24.0
conda: None
pytest: 8.2.2
mypy: None
IPython: 8.25.0
sphinx: None