Skip to content

[BUG] cudf.unstack produces incorrect MultiIndex column order #20446

@Matt711

Description

@Matt711

Describe the bug
Most of the TestArrowArray.test_unstack[...] cases in the pandas suite fail when unstacking multiple MultiIndex levels through cudf.pandas. The root-cause seems to be that cuDF's unstack builds column MultiIndexes in the wrong order relative to pandas.

Steps/Code to reproduce bug

bash python/cudf/cudf/pandas/scripts/run-pandas-tests.sh -q "tests/extension/test_arrow.py::TestArrowArray::test_unstack[int8-frame-index3]"
$ bash python/cudf/cudf/pandas/scripts/run-pandas-tests.sh -q "tests/extension/test_arrow.py::TestArrowArray::test_unstack[int8-frame-index3]" -vvv
Running Pandas tests for version 2.3.3
=============================================== test session starts ================================================
platform linux -- Python 3.13.9, pytest-8.4.2, pluggy-1.6.0 -- /home/coder/.conda/envs/rapids/bin/python
cachedir: .pytest_cache
benchmark: 5.1.0 (defaults: timer=time.perf_counter disable_gc=False min_rounds=5 min_time=0.000005 max_time=1.0 calibration_precision=10 warmup=False warmup_iterations=100000)
hypothesis profile 'ci' -> deadline=None, suppress_health_check=(HealthCheck.too_slow, HealthCheck.differing_executors)
rootdir: /home/coder/cudf/pandas-testing/pandas-tests
configfile: pyproject.toml
plugins: cases-3.9.1, cov-7.0.0, rerunfailures-16.1, benchmark-5.1.0, anyio-4.11.0, hypothesis-6.142.4, pytest_httpserver-1.1.3, xdist-3.8.0
collected 1 item                                                                                                   

tests/extension/test_arrow.py::TestArrowArray::test_unstack[int8-frame-index3] <- tests/extension/base/reshaping.py FAILED [100%]

===================================================== FAILURES =====================================================
__________________________________ TestArrowArray.test_unstack[int8-frame-index3] __________________________________

func = <function call_operator at 0x789d7853f880>
args = (<cudf.pandas.fast_slow_proxy._FunctionProxy object at 0x789d5afaedf0>, (   A                    B                  
0...   c  a     b     a     c
0  1     0     0  <NA>  1     0     0  <NA>
1  1  <NA>  <NA>     1  1  <NA>  <NA>     1), {})
kwargs = {}, disable_module_accelerator = <function disable_module_accelerator at 0x789d6c7368e0>, fast = False
slow_args = (<function assert_frame_equal at 0x789f8355d760>, (   A                    B                  
0  A           B       ...   c  a     b     a     c
0  1     0     0  <NA>  1     0     0  <NA>
1  1  <NA>  <NA>     1  1  <NA>  <NA>     1), {})
slow_kwargs = {}

    def _fast_slow_function_call(
        func: Callable,
        /,
        *args,
        **kwargs,
    ) -> Any:
        """
        Call `func` with all `args` and `kwargs` converted to their
        respective fast type. If that fails, call `func` with all
        `args` and `kwargs` converted to their slow type.
    
        Wrap the result in a fast-slow proxy if it is a type we know how
        to wrap.
        """
        from .module_accelerator import disable_module_accelerator
    
        fast = False
        try:
            with nvtx.annotate(
                "EXECUTE_FAST",
                color=_CUDF_PANDAS_NVTX_COLORS["EXECUTE_FAST"],
                domain="cudf_pandas",
            ):
>               fast_args, fast_kwargs = _fast_arg(args), _fast_arg(kwargs)
                                         ^^^^^^^^^^^^^^^

../../python/cudf/cudf/pandas/fast_slow_proxy.py:1011: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
../../python/cudf/cudf/pandas/fast_slow_proxy.py:1200: in _fast_arg
    return _transform_arg(arg, "_fsproxy_fast", seen)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
../../python/cudf/cudf/pandas/fast_slow_proxy.py:1123: in _transform_arg
    return tuple(
../../python/cudf/cudf/pandas/fast_slow_proxy.py:1124: in <genexpr>
    _transform_arg(a, attribute_name, seen) for a in arg
    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
../../python/cudf/cudf/pandas/fast_slow_proxy.py:1123: in _transform_arg
    return tuple(
../../python/cudf/cudf/pandas/fast_slow_proxy.py:1124: in <genexpr>
    _transform_arg(a, attribute_name, seen) for a in arg
    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
../../python/cudf/cudf/pandas/fast_slow_proxy.py:1079: in _transform_arg
    typ = getattr(arg, attribute_name)
          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
../../python/cudf/cudf/pandas/fast_slow_proxy.py:529: in _fsproxy_fast
    self._fsproxy_wrapped = self._fsproxy_slow_to_fast()
                            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
../../python/cudf/cudf/pandas/fast_slow_proxy.py:191: in _fsproxy_slow_to_fast
    return slow_to_fast(self._fsproxy_wrapped)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
../../python/cudf/cudf/utils/performance_tracking.py:52: in wrapper
    return func(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^
../../python/cudf/cudf/core/dataframe.py:8729: in from_pandas
    return DataFrame(obj, nan_as_null=nan_as_null)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
../../python/cudf/cudf/pandas/_wrappers/pandas.py:2246: in DataFrame_init_
    _original_DataFrame_init(self, data, index, columns, *args, **kwargs)
../../python/cudf/cudf/utils/performance_tracking.py:52: in wrapper
    return func(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^
../../python/cudf/cudf/core/dataframe.py:994: in __init__
    i: as_column(col_value.array, nan_as_null=nan_as_null)
       ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 

arbitrary = array([1, 1], dtype=object), nan_as_null = False, dtype = None, length = None

    def as_column(
        arbitrary: Any,
        nan_as_null: bool | None = None,
        dtype: Dtype | None = None,
        length: int | None = None,
    ) -> ColumnBase:
        """Create a Column from an arbitrary object
    
        Parameters
        ----------
        arbitrary : object
            Object to construct the Column from. See *Notes*.
        nan_as_null : bool, optional, default None
            If None (default), treats NaN values in arbitrary as null if there is
            no mask passed along with it. If True, combines the mask and NaNs to
            form a new validity mask. If False, leaves NaN values as is.
            Only applies when arbitrary is not a cudf object
            (Index, Series, Column).
        dtype : optional
            Optionally typecast the constructed Column to the given
            dtype.
        length : int, optional
            If `arbitrary` is a scalar, broadcast into a Column of
            the given length.
    
        Returns
        -------
        A Column of the appropriate type and size.
    
        Notes
        -----
        Currently support inputs are:
    
        * ``Column``
        * ``Series``
        * ``Index``
        * Scalars (can be broadcasted to a specified `length`)
        * Objects exposing ``__cuda_array_interface__`` (e.g., numba device arrays)
        * Objects exposing ``__array_interface__``(e.g., numpy arrays)
        * pyarrow array
        * pandas.Categorical objects
        * range objects
        """
        if isinstance(arbitrary, (range, pd.RangeIndex, cudf.RangeIndex)):
            with acquire_spill_lock():
                column = ColumnBase.from_pylibcudf(
                    plc.filling.sequence(
                        len(arbitrary),
                        pa_scalar_to_plc_scalar(
                            pa.scalar(arbitrary.start, type=pa.int64())
                        ),
                        pa_scalar_to_plc_scalar(
                            pa.scalar(arbitrary.step, type=pa.int64())
                        ),
                    )
                )
            if cudf.get_option("default_integer_bitwidth") and dtype is None:
                dtype = np.dtype(
                    f"i{cudf.get_option('default_integer_bitwidth') // 8}"
                )
            if dtype is not None:
                return column.astype(dtype)
            return column
        elif isinstance(arbitrary, (ColumnBase, cudf.Series, cudf.Index)):
            # Ignoring nan_as_null per the docstring
            if isinstance(arbitrary, cudf.Series):
                arbitrary = arbitrary._column
            elif isinstance(arbitrary, cudf.Index):
                arbitrary = arbitrary._column
            if dtype is not None:
                return arbitrary.astype(dtype)
            return arbitrary
        elif hasattr(arbitrary, "__cuda_array_interface__"):
            column = ColumnBase.from_cuda_array_interface(arbitrary)
            if nan_as_null is not False:
                column = column.nans_to_nulls()
            if dtype is not None:
                column = column.astype(dtype)
            return column
        elif isinstance(arbitrary, (pa.Array, pa.ChunkedArray)):
            column = ColumnBase.from_arrow(arbitrary)
            if nan_as_null is not False:
                column = column.nans_to_nulls()
            if dtype is not None:
                column = column.astype(dtype)
            return column
        elif isinstance(
            arbitrary, (pd.Series, pd.Index, pd.api.extensions.ExtensionArray)
        ):
            if isinstance(arbitrary.dtype, (pd.SparseDtype, pd.PeriodDtype)):
                raise NotImplementedError(
                    f"cuDF does not yet support {type(arbitrary.dtype).__name__}"
                )
            elif (
                cudf.get_option("mode.pandas_compatible")
                and isinstance(arbitrary, (pd.DatetimeIndex, pd.TimedeltaIndex))
                and arbitrary.freq is not None
            ):
                raise NotImplementedError("freq is not implemented yet")
            elif isinstance(arbitrary.dtype, pd.IntervalDtype) and isinstance(
                arbitrary.dtype.subtype, pd.DatetimeTZDtype
            ):
                raise NotImplementedError(
                    "cuDF does not yet support Intervals with timezone-aware datetimes"
                )
            elif isinstance(
                arbitrary.dtype,
                (pd.CategoricalDtype, pd.IntervalDtype, pd.DatetimeTZDtype),
            ):
                if isinstance(arbitrary.dtype, pd.DatetimeTZDtype):
                    new_tz = get_compatible_timezone(arbitrary.dtype)
                    arbitrary = arbitrary.astype(new_tz)
                if isinstance(arbitrary.dtype, pd.CategoricalDtype):
                    if isinstance(
                        arbitrary.dtype.categories.dtype, pd.DatetimeTZDtype
                    ):
                        new_tz = get_compatible_timezone(
                            arbitrary.dtype.categories.dtype
                        )
                        new_cats = arbitrary.dtype.categories.astype(new_tz)
                        new_dtype = pd.CategoricalDtype(
                            categories=new_cats, ordered=arbitrary.dtype.ordered
                        )
                        arbitrary = arbitrary.astype(new_dtype)
                    elif (
                        isinstance(
                            arbitrary.dtype.categories.dtype, pd.IntervalDtype
                        )
                        and dtype is None
                    ):
                        # Conversion to arrow converts IntervalDtype to StructDtype
                        dtype = CategoricalDtype(
                            categories=arbitrary.dtype.categories,
                            ordered=arbitrary.dtype.ordered,
                        )
                result = as_column(
                    pa.array(arbitrary, from_pandas=True),
                    nan_as_null=nan_as_null,
                    dtype=dtype,
                    length=length,
                )
                if (
                    cudf.get_option("mode.pandas_compatible")
                    and isinstance(arbitrary.dtype, pd.CategoricalDtype)
                    and is_pandas_nullable_extension_dtype(
                        arbitrary.dtype.categories.dtype
                    )
                    and dtype is None
                ):
                    # Store pandas extension dtype directly in the column's dtype property
                    # TODO: Move this to near isinstance(arbitrary.dtype.categories.dtype, pd.IntervalDtype)
                    # check above, for which merge should be working fully with pandas nullable extension dtypes.
                    result = result._with_type_metadata(
                        CategoricalDtype(
                            categories=arbitrary.dtype.categories,
                            ordered=arbitrary.dtype.ordered,
                        )
                    )
                return result
            elif is_pandas_nullable_extension_dtype(arbitrary.dtype):
                if (
                    isinstance(arbitrary.dtype, pd.ArrowDtype)
                    and (arrow_type := arbitrary.dtype.pyarrow_dtype) is not None
                    and (
                        pa.types.is_date32(arrow_type)
                        or pa.types.is_binary(arrow_type)
                        or pa.types.is_dictionary(arrow_type)
                    )
                ):
                    raise NotImplementedError(
                        f"cuDF does not yet support {arbitrary.dtype}"
                    )
                if isinstance(arbitrary, (pd.Series, pd.Index)):
                    # pandas arrays define __arrow_array__ for better
                    # pyarrow.array conversion
                    arbitrary = arbitrary.array
                result = as_column(
                    pa.array(arbitrary, from_pandas=True),
                    nan_as_null=nan_as_null,
                    dtype=dtype,
                    length=length,
                )
                if cudf.get_option("mode.pandas_compatible"):
                    # Store pandas extension dtype directly in the column's dtype property
                    result = result._with_type_metadata(arbitrary.dtype)
                return result
            elif isinstance(
                arbitrary.dtype, pd.api.extensions.ExtensionDtype
            ) and not isinstance(arbitrary, NumpyExtensionArray):
                raise NotImplementedError(
                    "Custom pandas ExtensionDtypes are not supported"
                )
            elif arbitrary.dtype.kind in "fiubmM":
                # numpy dtype like
                if isinstance(arbitrary, NumpyExtensionArray):
                    arbitrary = np.array(arbitrary)
                arb_dtype = np.dtype(arbitrary.dtype)
                if arb_dtype.kind == "f" and arb_dtype.itemsize == 2:
                    raise TypeError("Unsupported type float16")
                elif arb_dtype.kind in "mM":
                    # not supported by cupy
                    arbitrary = np.asarray(arbitrary)
                else:
                    arbitrary = cp.asarray(arbitrary)
                return as_column(
                    arbitrary, nan_as_null=nan_as_null, dtype=dtype, length=length
                )
            elif arbitrary.dtype.kind == "O":
                pyarrow_array = None
                if isinstance(arbitrary, NumpyExtensionArray):
                    # infer_dtype does not handle NumpyExtensionArray
                    arbitrary = np.array(arbitrary, dtype=object)
                inferred_dtype = infer_dtype(
                    arbitrary,
                    skipna=(
                        not cudf.get_option("mode.pandas_compatible")
                        and nan_as_null is not False
                    ),
                )
                if inferred_dtype in ("mixed-integer", "mixed-integer-float"):
                    raise MixedTypeError("Cannot create column with mixed types")
                elif dtype is None and inferred_dtype not in (
                    "mixed",
                    "decimal",
                    "string",
                    "empty",
                    "boolean",
                ):
>                   raise TypeError(
                        f"Cannot convert a {inferred_dtype} of object type"
E                       TypeError: Cannot convert a integer of object type

../../python/cudf/cudf/core/column/column.py:2946: TypeError

During handling of the above exception, another exception occurred:

E   AssertionError: MultiIndex level [1] are different
    
    Attribute "names" are different
    [left]:  [0]
    [right]: [None]
All traceback entries are hidden. Pass `--full-trace` to see hidden and internal frames.

During handling of the above exception, another exception occurred:

self = <tests.extension.test_arrow.TestArrowArray object at 0x789364de0f00>
data = <ArrowExtensionArray>
[1, 0, 1, 0, 1]
Length: 5, dtype: int8[pyarrow]
index = MultiIndex([('A', 'a', 1),
            ('A', 'b', 0),
            ('A', 'a', 0),
            ('B', 'a', 0),
            ('B', 'c', 1)],
           )
obj = 'frame'

    @pytest.mark.parametrize(
        "index",
        [
            # Two levels, uniform.
            pd.MultiIndex.from_product(([["A", "B"], ["a", "b"]]), names=["a", "b"]),
            # non-uniform
            pd.MultiIndex.from_tuples([("A", "a"), ("A", "b"), ("B", "b")]),
            # three levels, non-uniform
            pd.MultiIndex.from_product([("A", "B"), ("a", "b", "c"), (0, 1, 2)]),
            pd.MultiIndex.from_tuples(
                [
                    ("A", "a", 1),
                    ("A", "b", 0),
                    ("A", "a", 0),
                    ("B", "a", 0),
                    ("B", "c", 1),
                ]
            ),
        ],
    )
    @pytest.mark.parametrize("obj", ["series", "frame"])
    def test_unstack(self, data, index, obj):
        data = data[: len(index)]
        if obj == "series":
            ser = pd.Series(data, index=index)
        else:
            ser = pd.DataFrame({"A": data, "B": data}, index=index)
    
        n = index.nlevels
        levels = list(range(n))
        # [0, 1, 2]
        # [(0,), (1,), (2,), (0, 1), (0, 2), (1, 0), (1, 2), (2, 0), (2, 1)]
        combinations = itertools.chain.from_iterable(
            itertools.permutations(levels, i) for i in range(1, n)
        )
    
        for level in combinations:
            result = ser.unstack(level=level)
            assert all(
                isinstance(result[col].array, type(data)) for col in result.columns
            )
    
            if obj == "series":
                # We should get the same result with to_frame+unstack+droplevel
                df = ser.to_frame()
    
                alt = df.unstack(level=level).droplevel(0, axis=1)
                tm.assert_frame_equal(result, alt)
    
            obj_ser = ser.astype(object)
    
            expected = obj_ser.unstack(level=level, fill_value=data.dtype.na_value)
            if obj == "series":
                assert (expected.dtypes == object).all()
    
            result = result.astype(object)
>           tm.assert_frame_equal(result, expected)

tests/extension/base/reshaping.py:332: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
../../python/cudf/cudf/pandas/fast_slow_proxy.py:721: in __call__
    result, _ = _fast_slow_function_call(
../../python/cudf/cudf/pandas/fast_slow_proxy.py:1064: in _fast_slow_function_call
    result = func(*slow_args, **slow_kwargs)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
../../python/cudf/cudf/pandas/fast_slow_proxy.py:27: in call_operator
    return fn(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^
pandas/_libs/testing.pyx:55: in pandas._libs.testing.assert_almost_equal
    ???
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 

>   ???
E   AssertionError: MultiIndex level [1] are different
E   
E   MultiIndex level [1] values are different (100.0 %)
E   [left]:  Index(['a', 'b', 'a', 'c', 'a', 'b', 'a', 'c'], dtype='object', name=1)
E   [right]: Index(['A', 'A', 'B', 'B', 'A', 'A', 'B', 'B'], dtype='object')
E   At positional index 0, first diff: a != A

pandas/_libs/testing.pyx:173: AssertionError
============================================= short test summary info ==============================================
FAILED tests/extension/test_arrow.py::TestArrowArray::test_unstack[int8-frame-index3] - AssertionError: MultiIndex level [1] are different

MultiIndex level [1] values are different (100.0 %)
[left]:  Index(['a', 'b', 'a', 'c', 'a', 'b', 'a', 'c'], dtype='object', name=1)
[right]: Index(['A', 'A', 'B', 'B', 'A', 'A', 'B', 'B'], dtype='object')
At positional index 0, first diff: a != A
================================================ 1 failed in 3.17s =================================================

I think the root cause comes down to the differences in unstack/pivot. See this reproducer

In [1]: import pandas as pd
   ...: import cudf
   ...: 
   ...: mi = pd.MultiIndex.from_product(
   ...:     [["A", "B"], ["a", "b", "c"], [0, 1, 2]],
   ...:     names=["L0", "L1", "L2"],
   ...: )
   ...: 
   ...: data = pd.Series(range(len(mi)), index=mi)
   ...: pdf = pd.DataFrame({"A": data, "B": data})
   ...: 
   ...: pd_res = pdf.unstack(level=(1, 0))
   ...: 
   ...: gdf = cudf.DataFrame(pdf)
   ...: gdf_res = gdf.unstack(level=(1, 0))
   ...: cudf_res = gdf_res.to_pandas()
   ...: 
   ...: print("=== pandas ===")
   ...: print(list(pd_res.columns))
   ...: print("=== cudf ===")
   ...: print(list(cudf_res.columns))
=== pandas ===
[('A', 'a', 'A'), ('A', 'b', 'A'), ('A', 'c', 'A'), ('A', 'a', 'B'), ('A', 'b', 'B'), ('A', 'c', 'B'), ('B', 'a', 'A'), ('B', 'b', 'A'), ('B', 'c', 'A'), ('B', 'a', 'B'), ('B', 'b', 'B'), ('B', 'c', 'B')]
=== cudf ===
[('A', 'a', 'A'), ('A', 'a', 'B'), ('A', 'b', 'A'), ('A', 'b', 'B'), ('A', 'c', 'A'), ('A', 'c', 'B'), ('B', 'a', 'A'), ('B', 'a', 'B'), ('B', 'b', 'A'), ('B', 'b', 'B'), ('B', 'c', 'A'), ('B', 'c', 'B')]

In [2]: pd_res
Out[2]: 
    A                    B                  
L1  a  b  c   a   b   c  a  b  c   a   b   c
L0  A  A  A   B   B   B  A  A  A   B   B   B
L2                                          
0   0  3  6   9  12  15  0  3  6   9  12  15
1   1  4  7  10  13  16  1  4  7  10  13  16
2   2  5  8  11  14  17  2  5  8  11  14  17

In [3]: gdf_res
Out[3]: 
    A                    B                  
L1  a      b      c      a      b      c    
L0  A   B  A   B  A   B  A   B  A   B  A   B
L2                                          
0   0   9  3  12  6  15  0   9  3  12  6  15
1   1  10  4  13  7  16  1  10  4  13  7  16
2   2  11  5  14  8  17  2  11  5  14  8  17

In [4]: cudf_res
Out[4]: 
    A                    B                  
L1  a      b      c      a      b      c    
L0  A   B  A   B  A   B  A   B  A   B  A   B
L2                                          
0   0   9  3  12  6  15  0   9  3  12  6  15
1   1  10  4  13  7  16  1  10  4  13  7  16
2   2  11  5  14  8  17  2  11  5  14  8  17

Expected behavior
Responsible for ~300 pandas test failures

Metadata

Metadata

Assignees

No one assigned

    Labels

    PythonAffects Python cuDF API.bugSomething isn't workingcudf.pandasIssues specific to cudf.pandas

    Type

    No type

    Projects

    Status

    Done

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions