Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

String dtype: more informative repr (keeping brief __str__) #61148

Draft
wants to merge 5 commits into
base: main
Choose a base branch
from

Conversation

jorisvandenbossche
Copy link
Member

@jorisvandenbossche jorisvandenbossche commented Mar 19, 2025

Attempt to address #59342

With the current version of the PR, the reprs for the different dtype variants are:

# default NaN variants
<StringDtype(na_value=nan)>
<StringDtype(storage='python', na_value=nan)>
# nullable NA variants
<StringDtype(na_value=<NA>)>
<StringDtype(storage='python', na_value=<NA>)>

Some questions to decide on:

  • Do we use surrounding <...> or not? (we are somewhat inconsistent internally for similar reprs; e.g. the Index repr does not use it, the ExtensionArray repr does)
    • Including the <..> makes it clearer that it is not necessarily exactly executable code, I think
  • Do we keep the current __str__ as is (i.e. just "str" or "string"), or do we include the storage for the "string" case (to preserve the current repr behaviour). i.e. make it to have the options "str", "string[pyarrow]" or "string[python]".
    • Essentially, this comes down to changing the dtype.name attribute or not (which right now is defined to be "str" or "string")
    • As comparison, for DatetimeTZDtype we do include the [] parametrization in .name (e.g. "datetime64[s, UTC]"), while for CategoricalDtype we do not (there it is just "category")
    • If we don't use it in the name/str, we actually never show "string[python]", while we still allow that as string alias for dtype arguments (e.g. in constructors or in astype())
  • Currently I just use their own repr for pd.NA and np.nan, which means they are displayed as <NA> and nan
    • But we could also make it to look like pd.NA and np.nan. This makes it a more "executable" repr, which could be nice, but on the other hand I also don't want to encourage that too much.

@jorisvandenbossche jorisvandenbossche added Output-Formatting __repr__ of pandas objects, to_string Strings String extension data type and string data labels Mar 19, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Output-Formatting __repr__ of pandas objects, to_string Strings String extension data type and string data
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant