Skip to content

Fix schema enforcement in streaming _convert_to_arrow#8042

Open
HaukurPall wants to merge 1 commit intohuggingface:mainfrom
HaukurPall:fix-convert-to-arrow-schema-enforcement
Open

Fix schema enforcement in streaming _convert_to_arrow#8042
HaukurPall wants to merge 1 commit intohuggingface:mainfrom
HaukurPall:fix-convert-to-arrow-schema-enforcement

Conversation

@HaukurPall
Copy link
Contributor

When the iter_arrow chain is broken (e.g. after .map() without with_format("arrow")), _convert_to_arrow falls back to pa.Table.from_pylist() which infers types from values. If early batches contain None or empty lists, Arrow infers null/list types that conflict with later batches containing real data, causing ArrowInvalid schema mismatch errors.

Add an optional features parameter to _convert_to_arrow that applies cast_table_to_features to each produced table, mirroring what ArrowWriter.write_table does in the map-style path.

When the iter_arrow chain is broken (e.g. after .map() without
with_format("arrow")), _convert_to_arrow falls back to
pa.Table.from_pylist() which infers types from values. If early
batches contain None or empty lists, Arrow infers null/list<null>
types that conflict with later batches containing real data,
causing ArrowInvalid schema mismatch errors.

Add an optional features parameter to _convert_to_arrow that
applies cast_table_to_features to each produced table, mirroring
what ArrowWriter.write_table does in the map-style path.
@HaukurPall HaukurPall force-pushed the fix-convert-to-arrow-schema-enforcement branch from 291b12a to beef56c Compare March 11, 2026 12:30
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant