- 
                Notifications
    
You must be signed in to change notification settings  - Fork 981
 
Closed
Labels
PythonAffects Python cuDF API.Affects Python cuDF API.bugSomething isn't workingSomething isn't workingcudf.pandasIssues specific to cudf.pandasIssues specific to cudf.pandas
Description
Describe the bug
Most of the TestArrowArray.test_unstack[...] cases in the pandas suite fail when unstacking multiple MultiIndex levels through cudf.pandas. The root-cause seems to be that cuDF's unstack builds column MultiIndexes in the wrong order relative to pandas.
Steps/Code to reproduce bug
bash python/cudf/cudf/pandas/scripts/run-pandas-tests.sh -q "tests/extension/test_arrow.py::TestArrowArray::test_unstack[int8-frame-index3]"
$ bash python/cudf/cudf/pandas/scripts/run-pandas-tests.sh -q "tests/extension/test_arrow.py::TestArrowArray::test_unstack[int8-frame-index3]" -vvv
Running Pandas tests for version 2.3.3
=============================================== test session starts ================================================
platform linux -- Python 3.13.9, pytest-8.4.2, pluggy-1.6.0 -- /home/coder/.conda/envs/rapids/bin/python
cachedir: .pytest_cache
benchmark: 5.1.0 (defaults: timer=time.perf_counter disable_gc=False min_rounds=5 min_time=0.000005 max_time=1.0 calibration_precision=10 warmup=False warmup_iterations=100000)
hypothesis profile 'ci' -> deadline=None, suppress_health_check=(HealthCheck.too_slow, HealthCheck.differing_executors)
rootdir: /home/coder/cudf/pandas-testing/pandas-tests
configfile: pyproject.toml
plugins: cases-3.9.1, cov-7.0.0, rerunfailures-16.1, benchmark-5.1.0, anyio-4.11.0, hypothesis-6.142.4, pytest_httpserver-1.1.3, xdist-3.8.0
collected 1 item                                                                                                   
tests/extension/test_arrow.py::TestArrowArray::test_unstack[int8-frame-index3] <- tests/extension/base/reshaping.py FAILED [100%]
===================================================== FAILURES =====================================================
__________________________________ TestArrowArray.test_unstack[int8-frame-index3] __________________________________
func = <function call_operator at 0x789d7853f880>
args = (<cudf.pandas.fast_slow_proxy._FunctionProxy object at 0x789d5afaedf0>, (   A                    B                  
0...   c  a     b     a     c
0  1     0     0  <NA>  1     0     0  <NA>
1  1  <NA>  <NA>     1  1  <NA>  <NA>     1), {})
kwargs = {}, disable_module_accelerator = <function disable_module_accelerator at 0x789d6c7368e0>, fast = False
slow_args = (<function assert_frame_equal at 0x789f8355d760>, (   A                    B                  
0  A           B       ...   c  a     b     a     c
0  1     0     0  <NA>  1     0     0  <NA>
1  1  <NA>  <NA>     1  1  <NA>  <NA>     1), {})
slow_kwargs = {}
    def _fast_slow_function_call(
        func: Callable,
        /,
        *args,
        **kwargs,
    ) -> Any:
        """
        Call `func` with all `args` and `kwargs` converted to their
        respective fast type. If that fails, call `func` with all
        `args` and `kwargs` converted to their slow type.
    
        Wrap the result in a fast-slow proxy if it is a type we know how
        to wrap.
        """
        from .module_accelerator import disable_module_accelerator
    
        fast = False
        try:
            with nvtx.annotate(
                "EXECUTE_FAST",
                color=_CUDF_PANDAS_NVTX_COLORS["EXECUTE_FAST"],
                domain="cudf_pandas",
            ):
>               fast_args, fast_kwargs = _fast_arg(args), _fast_arg(kwargs)
                                         ^^^^^^^^^^^^^^^
../../python/cudf/cudf/pandas/fast_slow_proxy.py:1011: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
../../python/cudf/cudf/pandas/fast_slow_proxy.py:1200: in _fast_arg
    return _transform_arg(arg, "_fsproxy_fast", seen)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
../../python/cudf/cudf/pandas/fast_slow_proxy.py:1123: in _transform_arg
    return tuple(
../../python/cudf/cudf/pandas/fast_slow_proxy.py:1124: in <genexpr>
    _transform_arg(a, attribute_name, seen) for a in arg
    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
../../python/cudf/cudf/pandas/fast_slow_proxy.py:1123: in _transform_arg
    return tuple(
../../python/cudf/cudf/pandas/fast_slow_proxy.py:1124: in <genexpr>
    _transform_arg(a, attribute_name, seen) for a in arg
    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
../../python/cudf/cudf/pandas/fast_slow_proxy.py:1079: in _transform_arg
    typ = getattr(arg, attribute_name)
          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
../../python/cudf/cudf/pandas/fast_slow_proxy.py:529: in _fsproxy_fast
    self._fsproxy_wrapped = self._fsproxy_slow_to_fast()
                            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
../../python/cudf/cudf/pandas/fast_slow_proxy.py:191: in _fsproxy_slow_to_fast
    return slow_to_fast(self._fsproxy_wrapped)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
../../python/cudf/cudf/utils/performance_tracking.py:52: in wrapper
    return func(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^
../../python/cudf/cudf/core/dataframe.py:8729: in from_pandas
    return DataFrame(obj, nan_as_null=nan_as_null)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
../../python/cudf/cudf/pandas/_wrappers/pandas.py:2246: in DataFrame_init_
    _original_DataFrame_init(self, data, index, columns, *args, **kwargs)
../../python/cudf/cudf/utils/performance_tracking.py:52: in wrapper
    return func(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^
../../python/cudf/cudf/core/dataframe.py:994: in __init__
    i: as_column(col_value.array, nan_as_null=nan_as_null)
       ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
arbitrary = array([1, 1], dtype=object), nan_as_null = False, dtype = None, length = None
    def as_column(
        arbitrary: Any,
        nan_as_null: bool | None = None,
        dtype: Dtype | None = None,
        length: int | None = None,
    ) -> ColumnBase:
        """Create a Column from an arbitrary object
    
        Parameters
        ----------
        arbitrary : object
            Object to construct the Column from. See *Notes*.
        nan_as_null : bool, optional, default None
            If None (default), treats NaN values in arbitrary as null if there is
            no mask passed along with it. If True, combines the mask and NaNs to
            form a new validity mask. If False, leaves NaN values as is.
            Only applies when arbitrary is not a cudf object
            (Index, Series, Column).
        dtype : optional
            Optionally typecast the constructed Column to the given
            dtype.
        length : int, optional
            If `arbitrary` is a scalar, broadcast into a Column of
            the given length.
    
        Returns
        -------
        A Column of the appropriate type and size.
    
        Notes
        -----
        Currently support inputs are:
    
        * ``Column``
        * ``Series``
        * ``Index``
        * Scalars (can be broadcasted to a specified `length`)
        * Objects exposing ``__cuda_array_interface__`` (e.g., numba device arrays)
        * Objects exposing ``__array_interface__``(e.g., numpy arrays)
        * pyarrow array
        * pandas.Categorical objects
        * range objects
        """
        if isinstance(arbitrary, (range, pd.RangeIndex, cudf.RangeIndex)):
            with acquire_spill_lock():
                column = ColumnBase.from_pylibcudf(
                    plc.filling.sequence(
                        len(arbitrary),
                        pa_scalar_to_plc_scalar(
                            pa.scalar(arbitrary.start, type=pa.int64())
                        ),
                        pa_scalar_to_plc_scalar(
                            pa.scalar(arbitrary.step, type=pa.int64())
                        ),
                    )
                )
            if cudf.get_option("default_integer_bitwidth") and dtype is None:
                dtype = np.dtype(
                    f"i{cudf.get_option('default_integer_bitwidth') // 8}"
                )
            if dtype is not None:
                return column.astype(dtype)
            return column
        elif isinstance(arbitrary, (ColumnBase, cudf.Series, cudf.Index)):
            # Ignoring nan_as_null per the docstring
            if isinstance(arbitrary, cudf.Series):
                arbitrary = arbitrary._column
            elif isinstance(arbitrary, cudf.Index):
                arbitrary = arbitrary._column
            if dtype is not None:
                return arbitrary.astype(dtype)
            return arbitrary
        elif hasattr(arbitrary, "__cuda_array_interface__"):
            column = ColumnBase.from_cuda_array_interface(arbitrary)
            if nan_as_null is not False:
                column = column.nans_to_nulls()
            if dtype is not None:
                column = column.astype(dtype)
            return column
        elif isinstance(arbitrary, (pa.Array, pa.ChunkedArray)):
            column = ColumnBase.from_arrow(arbitrary)
            if nan_as_null is not False:
                column = column.nans_to_nulls()
            if dtype is not None:
                column = column.astype(dtype)
            return column
        elif isinstance(
            arbitrary, (pd.Series, pd.Index, pd.api.extensions.ExtensionArray)
        ):
            if isinstance(arbitrary.dtype, (pd.SparseDtype, pd.PeriodDtype)):
                raise NotImplementedError(
                    f"cuDF does not yet support {type(arbitrary.dtype).__name__}"
                )
            elif (
                cudf.get_option("mode.pandas_compatible")
                and isinstance(arbitrary, (pd.DatetimeIndex, pd.TimedeltaIndex))
                and arbitrary.freq is not None
            ):
                raise NotImplementedError("freq is not implemented yet")
            elif isinstance(arbitrary.dtype, pd.IntervalDtype) and isinstance(
                arbitrary.dtype.subtype, pd.DatetimeTZDtype
            ):
                raise NotImplementedError(
                    "cuDF does not yet support Intervals with timezone-aware datetimes"
                )
            elif isinstance(
                arbitrary.dtype,
                (pd.CategoricalDtype, pd.IntervalDtype, pd.DatetimeTZDtype),
            ):
                if isinstance(arbitrary.dtype, pd.DatetimeTZDtype):
                    new_tz = get_compatible_timezone(arbitrary.dtype)
                    arbitrary = arbitrary.astype(new_tz)
                if isinstance(arbitrary.dtype, pd.CategoricalDtype):
                    if isinstance(
                        arbitrary.dtype.categories.dtype, pd.DatetimeTZDtype
                    ):
                        new_tz = get_compatible_timezone(
                            arbitrary.dtype.categories.dtype
                        )
                        new_cats = arbitrary.dtype.categories.astype(new_tz)
                        new_dtype = pd.CategoricalDtype(
                            categories=new_cats, ordered=arbitrary.dtype.ordered
                        )
                        arbitrary = arbitrary.astype(new_dtype)
                    elif (
                        isinstance(
                            arbitrary.dtype.categories.dtype, pd.IntervalDtype
                        )
                        and dtype is None
                    ):
                        # Conversion to arrow converts IntervalDtype to StructDtype
                        dtype = CategoricalDtype(
                            categories=arbitrary.dtype.categories,
                            ordered=arbitrary.dtype.ordered,
                        )
                result = as_column(
                    pa.array(arbitrary, from_pandas=True),
                    nan_as_null=nan_as_null,
                    dtype=dtype,
                    length=length,
                )
                if (
                    cudf.get_option("mode.pandas_compatible")
                    and isinstance(arbitrary.dtype, pd.CategoricalDtype)
                    and is_pandas_nullable_extension_dtype(
                        arbitrary.dtype.categories.dtype
                    )
                    and dtype is None
                ):
                    # Store pandas extension dtype directly in the column's dtype property
                    # TODO: Move this to near isinstance(arbitrary.dtype.categories.dtype, pd.IntervalDtype)
                    # check above, for which merge should be working fully with pandas nullable extension dtypes.
                    result = result._with_type_metadata(
                        CategoricalDtype(
                            categories=arbitrary.dtype.categories,
                            ordered=arbitrary.dtype.ordered,
                        )
                    )
                return result
            elif is_pandas_nullable_extension_dtype(arbitrary.dtype):
                if (
                    isinstance(arbitrary.dtype, pd.ArrowDtype)
                    and (arrow_type := arbitrary.dtype.pyarrow_dtype) is not None
                    and (
                        pa.types.is_date32(arrow_type)
                        or pa.types.is_binary(arrow_type)
                        or pa.types.is_dictionary(arrow_type)
                    )
                ):
                    raise NotImplementedError(
                        f"cuDF does not yet support {arbitrary.dtype}"
                    )
                if isinstance(arbitrary, (pd.Series, pd.Index)):
                    # pandas arrays define __arrow_array__ for better
                    # pyarrow.array conversion
                    arbitrary = arbitrary.array
                result = as_column(
                    pa.array(arbitrary, from_pandas=True),
                    nan_as_null=nan_as_null,
                    dtype=dtype,
                    length=length,
                )
                if cudf.get_option("mode.pandas_compatible"):
                    # Store pandas extension dtype directly in the column's dtype property
                    result = result._with_type_metadata(arbitrary.dtype)
                return result
            elif isinstance(
                arbitrary.dtype, pd.api.extensions.ExtensionDtype
            ) and not isinstance(arbitrary, NumpyExtensionArray):
                raise NotImplementedError(
                    "Custom pandas ExtensionDtypes are not supported"
                )
            elif arbitrary.dtype.kind in "fiubmM":
                # numpy dtype like
                if isinstance(arbitrary, NumpyExtensionArray):
                    arbitrary = np.array(arbitrary)
                arb_dtype = np.dtype(arbitrary.dtype)
                if arb_dtype.kind == "f" and arb_dtype.itemsize == 2:
                    raise TypeError("Unsupported type float16")
                elif arb_dtype.kind in "mM":
                    # not supported by cupy
                    arbitrary = np.asarray(arbitrary)
                else:
                    arbitrary = cp.asarray(arbitrary)
                return as_column(
                    arbitrary, nan_as_null=nan_as_null, dtype=dtype, length=length
                )
            elif arbitrary.dtype.kind == "O":
                pyarrow_array = None
                if isinstance(arbitrary, NumpyExtensionArray):
                    # infer_dtype does not handle NumpyExtensionArray
                    arbitrary = np.array(arbitrary, dtype=object)
                inferred_dtype = infer_dtype(
                    arbitrary,
                    skipna=(
                        not cudf.get_option("mode.pandas_compatible")
                        and nan_as_null is not False
                    ),
                )
                if inferred_dtype in ("mixed-integer", "mixed-integer-float"):
                    raise MixedTypeError("Cannot create column with mixed types")
                elif dtype is None and inferred_dtype not in (
                    "mixed",
                    "decimal",
                    "string",
                    "empty",
                    "boolean",
                ):
>                   raise TypeError(
                        f"Cannot convert a {inferred_dtype} of object type"
E                       TypeError: Cannot convert a integer of object type
../../python/cudf/cudf/core/column/column.py:2946: TypeError
During handling of the above exception, another exception occurred:
E   AssertionError: MultiIndex level [1] are different
    
    Attribute "names" are different
    [left]:  [0]
    [right]: [None]
All traceback entries are hidden. Pass `--full-trace` to see hidden and internal frames.
During handling of the above exception, another exception occurred:
self = <tests.extension.test_arrow.TestArrowArray object at 0x789364de0f00>
data = <ArrowExtensionArray>
[1, 0, 1, 0, 1]
Length: 5, dtype: int8[pyarrow]
index = MultiIndex([('A', 'a', 1),
            ('A', 'b', 0),
            ('A', 'a', 0),
            ('B', 'a', 0),
            ('B', 'c', 1)],
           )
obj = 'frame'
    @pytest.mark.parametrize(
        "index",
        [
            # Two levels, uniform.
            pd.MultiIndex.from_product(([["A", "B"], ["a", "b"]]), names=["a", "b"]),
            # non-uniform
            pd.MultiIndex.from_tuples([("A", "a"), ("A", "b"), ("B", "b")]),
            # three levels, non-uniform
            pd.MultiIndex.from_product([("A", "B"), ("a", "b", "c"), (0, 1, 2)]),
            pd.MultiIndex.from_tuples(
                [
                    ("A", "a", 1),
                    ("A", "b", 0),
                    ("A", "a", 0),
                    ("B", "a", 0),
                    ("B", "c", 1),
                ]
            ),
        ],
    )
    @pytest.mark.parametrize("obj", ["series", "frame"])
    def test_unstack(self, data, index, obj):
        data = data[: len(index)]
        if obj == "series":
            ser = pd.Series(data, index=index)
        else:
            ser = pd.DataFrame({"A": data, "B": data}, index=index)
    
        n = index.nlevels
        levels = list(range(n))
        # [0, 1, 2]
        # [(0,), (1,), (2,), (0, 1), (0, 2), (1, 0), (1, 2), (2, 0), (2, 1)]
        combinations = itertools.chain.from_iterable(
            itertools.permutations(levels, i) for i in range(1, n)
        )
    
        for level in combinations:
            result = ser.unstack(level=level)
            assert all(
                isinstance(result[col].array, type(data)) for col in result.columns
            )
    
            if obj == "series":
                # We should get the same result with to_frame+unstack+droplevel
                df = ser.to_frame()
    
                alt = df.unstack(level=level).droplevel(0, axis=1)
                tm.assert_frame_equal(result, alt)
    
            obj_ser = ser.astype(object)
    
            expected = obj_ser.unstack(level=level, fill_value=data.dtype.na_value)
            if obj == "series":
                assert (expected.dtypes == object).all()
    
            result = result.astype(object)
>           tm.assert_frame_equal(result, expected)
tests/extension/base/reshaping.py:332: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
../../python/cudf/cudf/pandas/fast_slow_proxy.py:721: in __call__
    result, _ = _fast_slow_function_call(
../../python/cudf/cudf/pandas/fast_slow_proxy.py:1064: in _fast_slow_function_call
    result = func(*slow_args, **slow_kwargs)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
../../python/cudf/cudf/pandas/fast_slow_proxy.py:27: in call_operator
    return fn(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^
pandas/_libs/testing.pyx:55: in pandas._libs.testing.assert_almost_equal
    ???
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
>   ???
E   AssertionError: MultiIndex level [1] are different
E   
E   MultiIndex level [1] values are different (100.0 %)
E   [left]:  Index(['a', 'b', 'a', 'c', 'a', 'b', 'a', 'c'], dtype='object', name=1)
E   [right]: Index(['A', 'A', 'B', 'B', 'A', 'A', 'B', 'B'], dtype='object')
E   At positional index 0, first diff: a != A
pandas/_libs/testing.pyx:173: AssertionError
============================================= short test summary info ==============================================
FAILED tests/extension/test_arrow.py::TestArrowArray::test_unstack[int8-frame-index3] - AssertionError: MultiIndex level [1] are different
MultiIndex level [1] values are different (100.0 %)
[left]:  Index(['a', 'b', 'a', 'c', 'a', 'b', 'a', 'c'], dtype='object', name=1)
[right]: Index(['A', 'A', 'B', 'B', 'A', 'A', 'B', 'B'], dtype='object')
At positional index 0, first diff: a != A
================================================ 1 failed in 3.17s =================================================
I think the root cause comes down to the differences in unstack/pivot. See this reproducer
In [1]: import pandas as pd
   ...: import cudf
   ...: 
   ...: mi = pd.MultiIndex.from_product(
   ...:     [["A", "B"], ["a", "b", "c"], [0, 1, 2]],
   ...:     names=["L0", "L1", "L2"],
   ...: )
   ...: 
   ...: data = pd.Series(range(len(mi)), index=mi)
   ...: pdf = pd.DataFrame({"A": data, "B": data})
   ...: 
   ...: pd_res = pdf.unstack(level=(1, 0))
   ...: 
   ...: gdf = cudf.DataFrame(pdf)
   ...: gdf_res = gdf.unstack(level=(1, 0))
   ...: cudf_res = gdf_res.to_pandas()
   ...: 
   ...: print("=== pandas ===")
   ...: print(list(pd_res.columns))
   ...: print("=== cudf ===")
   ...: print(list(cudf_res.columns))
=== pandas ===
[('A', 'a', 'A'), ('A', 'b', 'A'), ('A', 'c', 'A'), ('A', 'a', 'B'), ('A', 'b', 'B'), ('A', 'c', 'B'), ('B', 'a', 'A'), ('B', 'b', 'A'), ('B', 'c', 'A'), ('B', 'a', 'B'), ('B', 'b', 'B'), ('B', 'c', 'B')]
=== cudf ===
[('A', 'a', 'A'), ('A', 'a', 'B'), ('A', 'b', 'A'), ('A', 'b', 'B'), ('A', 'c', 'A'), ('A', 'c', 'B'), ('B', 'a', 'A'), ('B', 'a', 'B'), ('B', 'b', 'A'), ('B', 'b', 'B'), ('B', 'c', 'A'), ('B', 'c', 'B')]
In [2]: pd_res
Out[2]: 
    A                    B                  
L1  a  b  c   a   b   c  a  b  c   a   b   c
L0  A  A  A   B   B   B  A  A  A   B   B   B
L2                                          
0   0  3  6   9  12  15  0  3  6   9  12  15
1   1  4  7  10  13  16  1  4  7  10  13  16
2   2  5  8  11  14  17  2  5  8  11  14  17
In [3]: gdf_res
Out[3]: 
    A                    B                  
L1  a      b      c      a      b      c    
L0  A   B  A   B  A   B  A   B  A   B  A   B
L2                                          
0   0   9  3  12  6  15  0   9  3  12  6  15
1   1  10  4  13  7  16  1  10  4  13  7  16
2   2  11  5  14  8  17  2  11  5  14  8  17
In [4]: cudf_res
Out[4]: 
    A                    B                  
L1  a      b      c      a      b      c    
L0  A   B  A   B  A   B  A   B  A   B  A   B
L2                                          
0   0   9  3  12  6  15  0   9  3  12  6  15
1   1  10  4  13  7  16  1  10  4  13  7  16
2   2  11  5  14  8  17  2  11  5  14  8  17Expected behavior
Responsible for  ~300 pandas test failures
Metadata
Metadata
Assignees
Labels
PythonAffects Python cuDF API.Affects Python cuDF API.bugSomething isn't workingSomething isn't workingcudf.pandasIssues specific to cudf.pandasIssues specific to cudf.pandas
Type
Projects
Status
Done