Skip to content

use_zarr_fill_value_as_mask=True is ignored in open_zarr #10269

Open
@rabernat

Description

@rabernat

What happened?

I have found an example where open_zarr appears to ignore the use_zarr_fill_value_as_mask=True argument.

What did you expect to happen?

Background: Zarr Arrays have a fill_value property. This is used to determine the value of any "uninitialized" part of an array, e.g. chunks that have never been written. Zarr itself has no concept of missing data or masked data. This parameter was optional in Zarr Python 2 and required in Zarr Python 3.

When we implemented Zarr 3 support, we also changed the behavior of Zarr fill values.

Before Zarr 3, the Zarr array fill_value was optional, and Xarray used this field analogously to NetCDF's special _FillValue attribute. The value of the underlying Zarr array fill_value was used to apply a mask to the array when decoding it.

After Zarr 3, the Zarr array fill_value became mandatory, and we could no longer keep that behavior because it triggered automatic masking of data. So an integer array with fill_value=0 (the default) would be coerced to float in order to apply the mask and turn all 0s to nans.

Instead, we added an option use_zarr_fill_value_as_mask, to toggle this behavior. When set to False, we just use a regular _FillValue attribute to store the sentinel value for the mask.

if self._use_zarr_fill_value_as_mask:
# Setting this attribute triggers CF decoding for missing values
# by interpreting Zarr's fill_value to mean the same as netCDF's _FillValue
if zarr_array.fill_value is not None:
attributes["_FillValue"] = zarr_array.fill_value
elif "_FillValue" in attributes:
original_zarr_dtype = zarr_array.metadata.data_type
attributes["_FillValue"] = FillValueCoder.decode(
attributes["_FillValue"], original_zarr_dtype.value
)

In the example below, I expected that setting use_zarr_fill_value_as_mask=True would produce the "old" behavior, but instead it appears to be ignored.


A separate but related issue is that the actual Zarr fill_value no longer appears in the encoding anywhere. That feels like a problem.

Minimal Complete Verifiable Example

import zarr
import xarray as xr
import numpy as np


# create a dataset with a different _FillValue attr and fill_value encoding
ds = xr.DataArray([1, 2, 3], attrs={"_FillValue": 1}, dims="x").to_dataset(name="foo")
ds.foo.encoding = {"fill_value": -99}

# write to zarr
store = zarr.storage.MemoryStore()
ds.to_zarr(store, zarr_format=3, consolidated=False)

# check that the fill_value encoding was propagated
za = zarr.open(store, path="foo")
assert za.fill_value == -99

# now resize the Zarr array to create true missing data
za.resize((4,))

# open this back up in xarray

# scenario 1: no mask_and_scale at all
# use_zarr_fill_value_as_mask=False should not mask the 
ds1 = xr.open_zarr(store, consolidated=False, zarr_format=3, chunks=None, mask_and_scale=False)

# ✅ no mask has been applied
np.testing.assert_equal(ds1.foo.values, [1, 2, 3, -99])
# _FillValue still in attrs
assert ds1.foo.attrs['_FillValue'] == 1

# note that that fill_value doesn't appear in encoding anywhere
assert "fill_value" not in ds1.foo.encoding

# scenario 2: default for Zarr 3 - mask_and_scale=True, use_zarr_fill_value_as_mask=False
ds2 = xr.open_zarr(store, consolidated=False, zarr_format=3, chunks=None, use_zarr_fill_value_as_mask=False)

# ✅ mask applied correctly
np.testing.assert_equal(ds2.foo.values, [np.nan, 2, 3, -99])
# _FillValue moved from attrs to encoding
assert ds2.foo.encoding['_FillValue'] == 1
assert "_FillValue" not in ds2.foo.attrs
# still no sign of -99
assert "fill_value" not in ds2.foo.encoding
assert "fill_value" not in ds2.foo.attrs

# scenario 3: use_zarr_fill_value_as_mask=True
ds3 = xr.open_zarr(store, consolidated=False, zarr_format=3, chunks=None, use_zarr_fill_value_as_mask=True) 
# ❌ mask not applied correctly
try:
    np.testing.assert_equal(ds2.foo.values, [1, 2, 3, np.nan])
except AssertionError:
    # the mask was not applied correctly
    print("got", ds2.foo.values)
    # _FillValue is in both encoding and attrs!
    print("encoding", ds2.foo.encoding)
    print("attrs", ds1.foo.attrs)

MVCE confirmation

  • Minimal example — the example is as focused as reasonably possible to demonstrate the underlying issue in xarray.
  • Complete example — the example is self-contained, including all data and the text of any traceback.
  • Verifiable example — the example copy & pastes into an IPython prompt or Binder notebook, returning the result.
  • New issue — a search of GitHub Issues suggests this is not a duplicate.
  • Recent environment — the issue occurs with the latest version of xarray and its dependencies.

Relevant log output

got [ nan   2.   3. -99.]
encoding {'chunks': (3,), 'preferred_chunks': {'x': 3}, 'compressors': (ZstdCodec(level=0, checksum=False),), 'filters': (), 'shards': None, 'serializer': BytesCodec(endian=<Endian.little: 'little'>), '_FillValue': 1, 'dtype': dtype('int64')}
attrs {'_FillValue': 1}

Anything else we need to know?

No response

Environment

INSTALLED VERSIONS
------------------
commit: None
python: 3.12.4 | packaged by conda-forge | (main, Jun 17 2024, 10:23:07) [GCC 12.3.0]
python-bits: 64
OS: Linux
OS-release: 5.10.220-209.869.amzn2.x86_64
machine: x86_64
processor: x86_64
byteorder: little
LC_ALL: C.UTF-8
LANG: C.UTF-8
LOCALE: ('C', 'UTF-8')
libhdf5: 1.14.3
libnetcdf: 4.9.2

xarray: 2025.3.1
pandas: 2.2.2
numpy: 1.26.4
scipy: 1.13.1
netCDF4: 1.6.5
pydap: installed
h5netcdf: 1.3.0
h5py: 3.11.0
zarr: 3.0.7
cftime: 1.6.4
nc_time_axis: 1.4.1
iris: None
bottleneck: 1.4.0
dask: 2024.6.2
distributed: 2024.6.2
matplotlib: 3.8.4
cartopy: 0.23.0
seaborn: 0.13.2
numbagg: 0.8.1
fsspec: 2024.6.0
cupy: None
pint: 0.23
sparse: 0.15.4
flox: 0.9.9
numpy_groupies: 0.11.1
setuptools: 70.1.0
pip: 24.0
conda: None
pytest: 8.2.2
mypy: None
IPython: 8.25.0
sphinx: None

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugneeds triageIssue that has not been reviewed by xarray team member

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions