Open
Description
Originally raised in #58551 (comment)
Problem Description
With PDEP-14 there is the need for developers to be aware of the storage used for strings. Indeed, the storage might have a lot of impact of performance, for instance
pyarrow
storage- pros: compact (optimal memory footprint), fast (vectorization)
- cons: immutable (so any modification creates a new string pyarrow
ChunkedArray
)
python
storage- pros: mutable
- cons: highest memory footprint (each string is a different Python object), slow (no vectorization)
numpy
2.0 strings storage (I don't have a good knowledge of these new strings, and never tested them)- pros: compact, vectorization, mutable (my understanding is that is takes more space and is slower than pyarrow strings)
- cons: different representations depending on a string size, which make understanding performance harder
Feature Description
I would like to have two way to discover the storage
__repr__
goal is to give information on the inner of an object, one option suggested by @jorisvandenbossche is to display<pandas.StringDtype(storage=...)>
instead ofstring[storage]
.get_storage
that returns the storage (not sure what is possible with the current implementation, would be best to have a class, otherwise, a string). The API is useful to check before running a time consuming code that we have the correct storage.
Alternative Solutions
.
Additional Context
No response