Use of native sparse array support in xarray / pandas / netCDF #800

irm-codebase · 2024-07-06T12:39:20Z

irm-codebase
Jul 6, 2024
Collaborator

What can be improved?

Calliope should be more memory efficient! <- Finally this one is applicable

While checking how to better support sparsity in our ecosystem, I found about sparse. It is quite literally focused on the common use-case of super sparse data in ESOMs.

After some investigation, it seems like xarray either has, or is planning to, roll out support for this library: pydata/xarray#3213. pandas seems to have also rolled out support.

I propose to evaluate how to integrate this into calliope, with two key design goals in mind:

Cut out "fattening" that happens in herently in xarray setups.
Find a way to refer to the dim itself in our constraints for cases were they are ordered. Two use cases (for pathways):
- Being able to do "sum from year X to year Y"
- Being able to do use "year current - year installed", and then use it as a reference to get a time-sensitive parameter that depends on it (e.g., by passing it to an inner function like gamma, a linear efficiency decrease, etc).

Version

v0.7

brynpickering · 2024-07-06T13:21:25Z

brynpickering
Jul 6, 2024
Maintainer

We've investigated sparse arrays a few times in the past. The problem is the data types that can be handled sparsely. If it can now handle pyomo/gurobi objects then that's great!

0 replies

irm-codebase · 2024-07-06T13:56:53Z

irm-codebase
Jul 6, 2024
Collaborator Author

Can you elaborate on the incompatibility?
The linked thread has MESSAGEix and PyPSA modellers, so it might be possible through some method.
Here is another thread that goes in detail for xarray: pydata/xarray#1375

0 replies

brynpickering · 2024-07-09T15:36:12Z

brynpickering
Jul 9, 2024
Maintainer

I did some tests on this locally and it doesn't work out-of-the-box. Operations between sparse arrays don't work when the data types are objects, e.g.:

import calliope
import xarray

m = calliope.examples.urban_scale()
m.build(force=True)

# you can't transform an existing array into sparse, so this is a quick hack.
foo = xr.DataArray.from_series(m.backend.variables.flow_cap.to_series(), sparse=True)
bar = xr.DataArray.from_series(m.backend.parameters.flow_cap_max.to_series(), sparse=True)

foo * bar

[Out] TypeError: Implicit conversion of Pyomo numeric value (parameters[flow_cap_max][12]*variables[flow_cap][0]) to float is disabled.
This error is often the result of using Pyomo components as arguments to
one of the Python built-in math module functions when defining
expressions. Avoid this error by using Pyomo-provided math functions or
explicitly resolving the numeric value using the Pyomo value() function.

I've also tried setting the fill value of foo and bar to non-NaN vals and using np.multiply(foo, bar, dtype=np.object_). These all end up creating dtype issues in the scipy.sparse library, e.g.:

File ~/miniforge3/envs/calliope/lib/python3.12/site-packages/sparse/_umath.py:542, in _Elemwise._get_fill_value(self)
    540 # Store dtype separately if needed.
    541 if self.dtype is not None:
--> 542     fill_value = fill_value.astype(self.dtype)
    544 self.fill_value = fill_value
    545 self.dtype = self.fill_value.dtype

AttributeError: 'float' object has no attribute 'astype'

File ~/miniforge3/envs/calliope/lib/python3.12/site-packages/sparse/_umath.py:524, in _Elemwise._get_fill_value(self)
    521     fill_value_array = self.func(*np.broadcast_arrays(*zero_args), **self.kwargs)
    523 try:
--> 524     fill_value = fill_value_array[(0,) * fill_value_array.ndim]
    525 except IndexError:
    526     zero_args = tuple(
    527         arg.fill_value if isinstance(arg, COO) else _zero_of_dtype(arg.dtype) for arg in self.args
    528     )

AttributeError: 'float' object has no attribute 'ndim'

File ~/miniforge3/envs/calliope/lib/python3.12/site-packages/sparse/_umath.py:545, in _Elemwise._get_fill_value(self)
    542     fill_value = fill_value.astype(self.dtype)
    544 self.fill_value = fill_value
--> 545 self.dtype = self.fill_value.dtype

AttributeError: 'float' object has no attribute 'dtype'

It works fine for operations on purely numeric data, just not when working with object arrays.

0 replies

irm-codebase · 2024-07-19T08:46:00Z

irm-codebase
Jul 19, 2024
Collaborator Author

@brynpickering after messing around with the backend, I see what you mean... this one will be tough.

The big issue here is that np.nan consumes the same amount of memory as a float (64 bits), and we currently have nan for stuff like carriers (i.e., adding a new carrier will add nans for all cases with that dim). So, adding a single heat technology to an electricity only model may bloat some parts by twice the amount of memory.

Pyomo does support sparsity natively... but I do not yet know the backend well enough to know how/what needs to change.

Algorithmically, a solution should be possible if we have full determinism when flattening matrixes. i.e., we always know the following:

the maximum number of dims combinations.
the order in which dims will be added in relation to each other.
the order in which elements in each dim will be added.

Not sure if we currently have this, but it should allow us to lazily fill in sparse vectors and then drop them to the backend. At least in theory...

0 replies

brynpickering · 2024-07-19T10:21:13Z

brynpickering
Jul 19, 2024
Maintainer

We're already effectively using dense arrays as far as Pyomo is concerned. It's the application of operations across N dimensions (incl. broadcasting capabilities) that we benefit from by also representing those objects in NaN-filled arrays. We came from this full determinism in v0.6 to what we have now because it is very messy to ensure this in a generalised way, especially when math components don't share the exact same dimensions.

0 replies

irm-codebase · 2024-07-19T12:25:18Z

irm-codebase
Jul 19, 2024
Collaborator Author

Yep, I understand that the issue is not in the backend. The increase in memory will only happen on our side...
As far as a generalized algorithm... would something like this work?
I've used something like this for similar problems.

import itertools

dims = ["techs", "nodes", "steps", "carriers", "foo", "bar", "perrito"]
all_unique_combinations_sorted = set()
for i in range(len(dims)):
    for group in itertools.combinations(dims, i+1):
        all_unique_combinations_sorted.add(".".join(sorted(group)).lower())
all_unique_combinations_sorted.add(["GLOBALS"])  # no idea if this is even needed

This contains all possible combinations of our dimensions, always in order (127 + GLOBALS in this example).
Create a dict with them, and store a sparse xarray in it. Add flattened vector data to it.

Lookups are easy: just sort and join the dimensions you want. Similarly, the large number of keys does not matter too much. Sparse data has little memory impact by design, and you can easily erase empty combinations if you wish.

Would this work, or am I saying something silly?

0 replies

brynpickering · 2024-09-26T08:51:18Z

brynpickering
Sep 26, 2024
Maintainer

I'm closing this as not an issue that we plan to address, given how small our datasets are (even with all their NaNs) compared to peak memory consumption in the optimisation step (see this comment).

0 replies

irm-codebase · 2024-09-30T13:46:51Z

irm-codebase
Sep 30, 2024
Collaborator Author

@brynpickering I'm ok in closing this.

Some closing remarks: my worry actually relates to the size of the optimisation problem.
If we have a NaN parameter in the input data, there are two things that could happen:

The backend still has to go through it when constructing the model.
Depending on our checks, this NaN might trigger the generation of 'unused' constraints and variables, which will bloat the end model without affecting results.

However, I should've done more checks on this, so I will only re-open this if I identify it as a problem that reaches the actual optimisation.

1 reply

brynpickering Aug 21, 2025
Maintainer

The backend doesn't go through the nans when constructing the model. The xarray arrays are just dimensioned views on the underlying pyomo/gurobi objects. Those objects are stored in completely dense arrays within the respective instances.

brynpickering · 2025-08-21T10:58:13Z

brynpickering
Aug 21, 2025
Maintainer

@irm-codebase I've moved this to a discussion so we can slowly chip away at it. One thing I've found I can do to circumvent the way sparse handles object dtypes in its arithmetic operations is to create a fill value object that always remains an object no matter what sparse tries to throw at it:

import calliope
import xarray as xr
import numpy as np

class FillObj:
    def __init__(self, value):
        self.value = float(value)
    
    @property
    def dtype(self):
        return np.dtype('O')  # Object dtype
    
    def __add__(self, other):
        return FillObj(self.value)
    
    def __sub__(self, other):
        return FillObj(self.value)
    
    def __mul__(self, other):
        return FillObj(self.value)
    
    def __truediv__(self, other):
        return FillObj(self.value)
    
    def __floordiv__(self, other):
        return FillObj(self.value)
    
    def __mod__(self, other):
        return FillObj(self.value)
    
    def __pow__(self, other):
        return FillObj(self.value)
    
    # Reverse operations
    def __radd__(self, other):
        return FillObj(self.value)
    
    def __rsub__(self, other):
        return FillObj(self.value)
    
    def __rmul__(self, other):
        return FillObj(self.value)
    
    def __rtruediv__(self, other):
        return FillObj(self.value)
    
    def __rfloordiv__(self, other):
        return FillObj(self.value)
    
    def __rmod__(self, other):
        return FillObj(self.value)
    
    def __rpow__(self, other):
        return FillObj(self.value)
    
    def __repr__(self):
        return f"<{self.value}>"
    
    def __float__(self):
        return self.value


m = calliope.examples.national_scale(time_subset=None)  # full year
m.build()

# you can't transform an existing array into sparse, so this is a quick hack.
foo = xr.DataArray.from_series(m.backend.variables.flow_cap.to_series(), sparse=True)
bar = xr.DataArray.from_series(m.backend.parameters.flow_out.to_series(), sparse=True)

for sparse_arr in [foo, bar]:
    sparse_arr.data.fill_value = FillObj(np.nan)
    
sparse_da = foo * bar
dense_da = m.backend.variables.flow_cap * m.backend.parameters.flow_out

The problem I'm having is that the size of the resulting sparse array is larger than the dense one!

def quick_memory_comparison(dense_da, sparse_da):
    """Quick one-liner memory comparison"""
    dense_mb = dense_da.nbytes / (1024**2)
    sparse_mb = sparse_da.data.nbytes / (1024**2) if hasattr(sparse_da.data, 'nbytes') else \
                (sum(coord.nbytes for coord in sparse_da.data.coords) + sparse_da.data.data.nbytes) / (1024**2)
    
    print(f"Dense: {dense_mb:.2f} MB | Sparse: {sparse_mb:.2f} MB | Ratio: {dense_mb/sparse_mb:.2f}x")
    return dense_mb, sparse_mb
    
quick_memory_comparison(dense_da, sparse_da)

[Out] Dense: 2.67 MB | Sparse: 5.68 MB | Ratio: 0.47x

3 replies

irm-codebase Aug 21, 2025
Collaborator Author

I am not sure what the method above would do exactly since you are cheating the array from our existing setup...

As far as I understand, sparse arrays must be defined as such from the start, and then 'filled'.
My idea for sparse arrays was to define the input data directly using COO and the schema setup we have, and then filling in the user-given data.

Assume some parameter with default=5. Then:

import xarray as xr
import sparse
import numpy as np

# Define your dimensions
dims = ("techs", "nodes", "timesteps")
shape = (3, 2, 4)  # 3 techs, 2 nodes, 4 timesteps

coords = np.empty((len(dims), 0), dtype=int)  # ndim * non-zero values (here: 3 × 0) because it starts empty
data = np.array([], dtype=int)

s = sparse.COO(
    coords=coords,
    data=data,
    shape=shape,
    fill_value=5  # assume this parameter has a default of 5
)

da = xr.DataArray(
    s,
    dims=dims,
    coords={
        "techs": [f"tech{i}" for i in range(shape[0])],  # numbers here, but they could just be the names of the techs
        "nodes": [f"node{i}" for i in range(shape[1])],  # ditto
        "timesteps": np.arange(shape[2]),  # ditto
    },
    name="parameter"
)

Keep in mind: I'm not an expert in using sparse! but this should work according to their docs

irm-codebase Aug 21, 2025
Collaborator Author

One last bit I just learned about @brynpickering : netCDF is incompatible with sparse arrays!
So, if implemented, we'd have to fully move to using zarr.

irm-codebase Aug 21, 2025
Collaborator Author

Also, I am not sure if this would really work for variables... these are just defined for the backend with a given dimensionality.

Use of native sparse array support in xarray / pandas / netCDF #800

Uh oh!

Uh oh!

irm-codebase Jul 6, 2024 Collaborator

What can be improved?

Version

Replies: 9 comments · 4 replies

Uh oh!

brynpickering Jul 6, 2024 Maintainer

Uh oh!

Uh oh!

irm-codebase Jul 6, 2024 Collaborator Author

Uh oh!

brynpickering Jul 9, 2024 Maintainer

Uh oh!

Uh oh!

irm-codebase Jul 19, 2024 Collaborator Author

Uh oh!

brynpickering Jul 19, 2024 Maintainer

Uh oh!

Uh oh!

irm-codebase Jul 19, 2024 Collaborator Author

Uh oh!

brynpickering Sep 26, 2024 Maintainer

Uh oh!

irm-codebase Sep 30, 2024 Collaborator Author

Uh oh!

brynpickering Aug 21, 2025 Maintainer

Uh oh!

brynpickering Aug 21, 2025 Maintainer

Uh oh!

Uh oh!

irm-codebase Aug 21, 2025 Collaborator Author

Uh oh!

irm-codebase Aug 21, 2025 Collaborator Author

Uh oh!

irm-codebase Aug 21, 2025 Collaborator Author

irm-codebase
Jul 6, 2024
Collaborator

Replies: 9 comments 4 replies

brynpickering
Jul 6, 2024
Maintainer

irm-codebase
Jul 6, 2024
Collaborator Author

brynpickering
Jul 9, 2024
Maintainer

irm-codebase
Jul 19, 2024
Collaborator Author

brynpickering
Jul 19, 2024
Maintainer

irm-codebase
Jul 19, 2024
Collaborator Author

brynpickering
Sep 26, 2024
Maintainer

irm-codebase
Sep 30, 2024
Collaborator Author

brynpickering Aug 21, 2025
Maintainer

brynpickering
Aug 21, 2025
Maintainer

irm-codebase Aug 21, 2025
Collaborator Author

irm-codebase Aug 21, 2025
Collaborator Author

irm-codebase Aug 21, 2025
Collaborator Author