Skip to content

ENH: Consistent naming conventions for string dtype aliases #58141

Open
@WillAyd

Description

Feature Type

  • Adding new functionality to pandas

  • Changing existing functionality in pandas

  • Removing existing functionality in pandas

Problem Description

Right now the string aliases for our types is inconsistent

>>> import pandas as pd
>>> pd.Series(range(3), dtype="int8")  # NumPy type
>>> pd.Series(range(3), dtype="Int8")  # Pandas extension type
>>> pd.Series(range(3), dtype="int8[pyarrow]") # Arrow type

Strings have a similar inconsistency with "string", "string[pyarrow]" and "string[pyarrow_numpy]"

Feature Description

I think we should create"int8[numpy]" and "int8[pandas]" aliases to stay consistent with pyarrow. This also has the advantage of decoupling "int8" from NumPy, so perhaps in the future we can allow the setting of the backend determine if NumPy or pyarrow types are returned

The pattern thus becomes "data_type[backend]", with the exception of "string[pyarrow_numpy]" which combines combines the backend and nullability semantics together. I am less sure what to do in that case - maybe even that should be called "string[pyarrow, numpy]" where the second argument is nullability?

In any case I am just hoping we can start to detach the logical type from the physical storage / nulllability semantics with a well defined pattern

@phofl

Alternative Solutions

n/a

Additional Context

No response

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions