BUG: fix read_csv pyarrow engine column-name and dtype handling by jbrockmendel · Pull Request #65859 · pandas-dev/pandas

jbrockmendel · 2026-06-11T21:20:50Z

Audit of the @xfail_pyarrow tests in pandas/tests/io/parser found ~30 cases caused by a handful of ArrowParserWrapper bugs rather than pyarrow limitations. This fixes them and un-xfails the affected tests:

Duplicated column names are now mangled to "x.1"-style names, mirroring the dedup algorithm in pandas._libs.parsers.TextReader line-for-line, including the dtype-key propagation from original to mangled names (BUG: in read_excel for mangled columns only the original/first column dtype is correct, col.N is not parsed correctly #35211, BUG: read_csv now errors with non-dict dtype and same-name columns #42022) and the named-before-unnamed loop order. Previously the pyarrow engine returned frames with duplicate columns, which also broke index_col ("Found non-unique column index").
Empty header fields get "Unnamed: {i}" placeholder names instead of "" (Three or more unnamed fields block loc assignment #13017).
An unnamed index_col now produces an unnamed index level instead of an index named "".
Non-dict dtype together with index_col no longer raises AttributeError: type object 'str' has no attribute 'get'.
A defaultdict dtype now applies its default to columns not explicitly listed (ENH: support defaultdict in read_csv dtype parameter #41574).

One parametrization stays xfailed with a narrower conditional mark: test_dtype_all_columns[object], where scalar dtype=object is also applied to the index column (the str variant passes).

No user-facing behavior changes for the other engines; the test diff is marker removals only.

Tests added and passed
All code checks passed

🤖 Generated with Claude Code

Brings the pyarrow engine's header/dtype handling in line with the other engines: - duplicated column names are now mangled to "x.1"-style names, mirroring the algorithm in pandas._libs.parsers.TextReader (including dtype-key propagation to mangled names, GH#35211) - empty header fields become "Unnamed: {i}" placeholder names - an unnamed index_col now produces an unnamed index level instead of an index named "" - non-dict dtype with index_col no longer raises AttributeError - defaultdict dtype now applies its default to unlisted columns (GH#41574) Un-xfails the 30 tests these bugs were responsible for. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

- narrow self.names for pyright in defaultdict dtype block - test_dtype_all_columns: xfail object-index case only under infer_string; xfail check_orig=False object case unconditionally Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

jbrockmendel added Bug IO CSV read_csv, to_csv Arrow pyarrow functionality labels Jun 11, 2026

jbrockmendel and others added 2 commits June 11, 2026 14:21

DOC: whatsnew for pyarrow read_csv fixes

f14ac45

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

TST/TYP: fix CI failures

743443b

- narrow self.names for pyright in defaultdict dtype block - test_dtype_all_columns: xfail object-index case only under infer_string; xfail check_orig=False object case unconditionally Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

jbrockmendel mentioned this pull request Jun 11, 2026

BUG: raise helpful errors for unsupported pyarrow read_csv usage #65862

Draft

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

BUG: fix read_csv pyarrow engine column-name and dtype handling#65859

BUG: fix read_csv pyarrow engine column-name and dtype handling#65859
jbrockmendel wants to merge 3 commits into
pandas-dev:mainfrom
jbrockmendel:tst-pyarrow-csv

jbrockmendel commented Jun 11, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

jbrockmendel commented Jun 11, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant