API: categories.dtype with pd.read_csv(..., dtype='category') #56044
Open
Description
import pandas as pd
from io import StringIO
data = """a,b,c
1,a,3.4
1,a,3.4
2,b,4.5"""
res1 = pd.read_csv(StringIO(data), dtype="category")
res2 = pd.read_csv(StringIO(data), dtype="category", engine="pyarrow")
>>> res1['a'].dtype
CategoricalDtype(categories=['1', '2'], ordered=False, categories_dtype=object)
>>> res2['a'].dtype
CategoricalDtype(categories=[1, 2], ordered=False, categories_dtype=int64)
This example is based on tests.io.parser.dtypes.test_categorical.test_categorical_dtype.
We should decide on which of these two behaviors we want and try to get it consistently.
The pyarrow version does all its parsing, then calls .astype("category") in _finalize_pandas_output
. The others look like they go through Categorical._from_inferred_categories. The python parser explicitly casts back to strings inside _cast_types, with a comment that it does this only for consistency with the c-parser.
I think the integer categories are a more reasonable behavior. Mainly I dislike that the pyarrow engine silently has different behavior (see xfailed tests in tests.io.parser.dtypes.test_categorical
)