Skip to content

ENH: Need API support and __repr__ to discover the storage used for strings #59342

Open
@arnaudlegout

Description

@arnaudlegout

Originally raised in #58551 (comment)

Problem Description

With PDEP-14 there is the need for developers to be aware of the storage used for strings. Indeed, the storage might have a lot of impact of performance, for instance

  • pyarrow storage
    • pros: compact (optimal memory footprint), fast (vectorization)
    • cons: immutable (so any modification creates a new string pyarrow ChunkedArray)
  • python storage
    • pros: mutable
    • cons: highest memory footprint (each string is a different Python object), slow (no vectorization)
  • numpy 2.0 strings storage (I don't have a good knowledge of these new strings, and never tested them)
    • pros: compact, vectorization, mutable (my understanding is that is takes more space and is slower than pyarrow strings)
    • cons: different representations depending on a string size, which make understanding performance harder

Feature Description

I would like to have two way to discover the storage

  • __repr__ goal is to give information on the inner of an object, one option suggested by @jorisvandenbossche is to display <pandas.StringDtype(storage=...)> instead of string[storage]
  • .get_storage that returns the storage (not sure what is possible with the current implementation, would be best to have a class, otherwise, a string). The API is useful to check before running a time consuming code that we have the correct storage.

Alternative Solutions

.

Additional Context

No response

Metadata

Metadata

Assignees

No one assigned

    Labels

    EnhancementNeeds DiscussionRequires discussion from core team before further actionStringsString extension data type and string data

    Type

    No type

    Projects

    No projects

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions