API: categories.dtype with pd.read_csv(..., dtype='category')

```python
import pandas as pd
from io import StringIO

data = """a,b,c
1,a,3.4
1,a,3.4
2,b,4.5"""

res1 = pd.read_csv(StringIO(data), dtype="category")
res2 = pd.read_csv(StringIO(data), dtype="category", engine="pyarrow")

>>> res1['a'].dtype
CategoricalDtype(categories=['1', '2'], ordered=False, categories_dtype=object)

>>> res2['a'].dtype
CategoricalDtype(categories=[1, 2], ordered=False, categories_dtype=int64)
```

This example is based on tests.io.parser.dtypes.test_categorical.test_categorical_dtype.

We should decide on which of these two behaviors we want and try to get it consistently.

The pyarrow version does all its parsing, then calls .astype("category") in `_finalize_pandas_output`.  The others look like they go through Categorical._from_inferred_categories.  The python parser explicitly casts back to strings inside _cast_types, with a comment that it does this only for consistency with the c-parser.

I think the integer categories are a more reasonable behavior.  Mainly I dislike that the pyarrow engine silently has different behavior (see xfailed tests in `tests.io.parser.dtypes.test_categorical`)


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

API: categories.dtype with pd.read_csv(..., dtype='category') #56044

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Uh oh!

API: categories.dtype with pd.read_csv(..., dtype='category') #56044

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions