Skip to content

API: categories.dtype with pd.read_csv(..., dtype='category') #56044

Open
@jbrockmendel

Description

import pandas as pd
from io import StringIO

data = """a,b,c
1,a,3.4
1,a,3.4
2,b,4.5"""

res1 = pd.read_csv(StringIO(data), dtype="category")
res2 = pd.read_csv(StringIO(data), dtype="category", engine="pyarrow")

>>> res1['a'].dtype
CategoricalDtype(categories=['1', '2'], ordered=False, categories_dtype=object)

>>> res2['a'].dtype
CategoricalDtype(categories=[1, 2], ordered=False, categories_dtype=int64)

This example is based on tests.io.parser.dtypes.test_categorical.test_categorical_dtype.

We should decide on which of these two behaviors we want and try to get it consistently.

The pyarrow version does all its parsing, then calls .astype("category") in _finalize_pandas_output. The others look like they go through Categorical._from_inferred_categories. The python parser explicitly casts back to strings inside _cast_types, with a comment that it does this only for consistency with the c-parser.

I think the integer categories are a more reasonable behavior. Mainly I dislike that the pyarrow engine silently has different behavior (see xfailed tests in tests.io.parser.dtypes.test_categorical)

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions