Skip to content

API (string dtype): comparisons between different string classes #60639

Open
@rhshadrach

Description

@rhshadrach

Some comparisons between different classes of string (e.g. string[pyarrow] and str) raise. Resolving this is straightforward except for what class should be returned. I would expect it should always be the left obj, e.g. string[pyarrow] == str should return string[pyarrow] whereas str == string[pyarrow] should return str. Is this the concensus?

We currently run into issues with how Python handles subclasses with comparison dunders.

lhs = pd.array(["x", pd.NA, "y"], dtype="string[pyarrow]")
rhs = pd.array(["x", pd.NA, "y"], dtype=pd.StringDtype("pyarrow", np.nan))

print(lhs.__eq__(rhs))
# <ArrowExtensionArray>
# [True, <NA>, True]
# Length: 3, dtype: bool[pyarrow]

print(lhs == rhs)
# [ True False  True]

The two results above differ because ArrowStringArrayNumpySemantics is a proper subclass of ArrowStringArray and therefore Python first calls rhs.__eq__(lhs).

We can avoid this by special casing this particular case in ArrowStringArrayNumpySemantics, but I wanted to open up an issue for discussion before proceeding.

cc @WillAyd @jorisvandenbossche

Metadata

Metadata

Assignees

No one assigned

    Labels

    API - ConsistencyInternal Consistency of API/BehaviorNeeds DiscussionRequires discussion from core team before further actionNumeric OperationsArithmetic, Comparison, and Logical operationsStringsString extension data type and string data

    Type

    No type

    Projects

    No projects

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions