Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BUG: Arrow Binary View Types Don't Print When containing missing values #59883

Open
3 tasks done
WillAyd opened this issue Sep 24, 2024 · 4 comments
Open
3 tasks done
Labels
Arrow pyarrow functionality Bug Needs Discussion Requires discussion from core team before further action Strings String extension data type and string data

Comments

@WillAyd
Copy link
Member

WillAyd commented Sep 24, 2024

Pandas version checks

  • I have checked that this issue has not already been reported.

  • I have confirmed this bug exists on the latest version of pandas.

  • I have confirmed this bug exists on the main branch of pandas.

Reproducible Example

Note that this example produces output:


In [10]: import pandas as pd
    ...: import pyarrow as pa
    ...: 
    ...: ser = pd.Series(["foo", "longer_than_binary_prefix", ''], dtype=pd.ArrowDtype(pa.string_view()))

In [11]: ser
Out[11]: 
0                          foo
1    longer_than_binary_prefix
2                             
dtype: string_view[pyarrow]

While this does not:

In [12]: import pandas as pd
    ...: import pyarrow as pa
    ...: 
    ...: ser = pd.Series(["foo", "longer_than_binary_prefix", None], dtype=pd.ArrowDtype(pa.string_view()))

In [13]: ser
Out[13]: 

This might actually be an upstream bug with pyarrow (@jorisvandenbossche typically knows best)



### Issue Description

Values are not printing

### Expected Behavior

Values should print

### Installed Versions

In [14]: pa.__version__
Out[14]: '17.0.0'

In [15]: pd.__version__
Out[15]: '2.2.3+44.g3dfa33cf2d'
@WillAyd WillAyd added Bug Needs Triage Issue that has not been reviewed by a pandas team member labels Sep 24, 2024
@jorisvandenbossche
Copy link
Member

Quickly checking, if I call to_string() explicitly, it does error:

In [12]: ser.to_string()
...
File ~/scipy/repos/pandas/pandas/core/arrays/arrow/array.py:1458, in ArrowExtensionArray.to_numpy(self, dtype, copy, na_value)
   1456     mask = data.isna()
   1457     result[mask] = na_value
-> 1458     result[~mask] = data[~mask]._pa_array.to_numpy()
   1459 return result

File ~/scipy/repos/pandas/pandas/core/arrays/arrow/array.py:591, in ArrowExtensionArray.__getitem__(self, item)
    589     return self.take(item)
    590 elif item.dtype.kind == "b":
--> 591     return type(self)(self._pa_array.filter(item))
    592 else:
    593     raise IndexError(
    594         "Only integers, slices and integer or "
    595         "boolean arrays are valid indices."
    596     )

File ~/conda/envs/dev/lib/python3.11/site-packages/pyarrow/table.pxi:959, in pyarrow.lib.ChunkedArray.filter()

File ~/conda/envs/dev/lib/python3.11/site-packages/pyarrow/compute.py:264, in _make_generic_wrapper.<locals>.wrapper(memory_pool, options, *args, **kwargs)
    262 if args and isinstance(args[0], Expression):
    263     return Expression._call(func_name, list(args), options)
--> 264 return func.call(args, options, memory_pool)

File ~/conda/envs/dev/lib/python3.11/site-packages/pyarrow/_compute.pyx:385, in pyarrow._compute.Function.call()

File ~/conda/envs/dev/lib/python3.11/site-packages/pyarrow/error.pxi:155, in pyarrow.lib.pyarrow_internal_check_status()

File ~/conda/envs/dev/lib/python3.11/site-packages/pyarrow/error.pxi:92, in pyarrow.lib.check_status()

ArrowNotImplementedError: Function 'array_filter' has no kernel matching input types (string_view, bool)

Essentially because it tries to convert to numpy, and that part is failing (because of a kernel not being implemented for string_view).

Some quick thoughts:

  • Ideally to_numpy() would work regardless of filter being implemented or not (although hopefully a next pyarrow release will support this)
  • The empty repr is quite confusing .. Probably printing something like pandas.Series <exception occurred while creating the repr> would be more useful?

@WillAyd
Copy link
Member Author

WillAyd commented Sep 25, 2024

  • Probably printing something like pandas.Series <exception occurred while creating the repr> would be more useful?

Makes sense for the series, but would this affect the repr when contained within a dataframe?

@rhshadrach rhshadrach added Arrow pyarrow functionality Strings String extension data type and string data labels Sep 30, 2024
@rhshadrach
Copy link
Member

@WillAyd - should the title be Arrow String View? Want to make sure I'm understanding the issue.

@rhshadrach rhshadrach added Needs Discussion Requires discussion from core team before further action and removed Needs Triage Issue that has not been reviewed by a pandas team member labels Sep 30, 2024
@WillAyd
Copy link
Member Author

WillAyd commented Sep 30, 2024

I don't think so - binary view is the terminology used by the arrow specification, which generally covers what you may be thinking of as bytes and strings:

https://arrow.apache.org/docs/format/Columnar.html#variable-size-binary-view-layout

The same issue occurs with the binary_view pyarrow type as well

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Arrow pyarrow functionality Bug Needs Discussion Requires discussion from core team before further action Strings String extension data type and string data
Projects
None yet
Development

No branches or pull requests

3 participants