-
Notifications
You must be signed in to change notification settings - Fork 3k
Open
Description
Describe the bug
IterableDataset created from generator with explicit features= parameter seems to ignore provided features description for certain operations, e.g. .to_pandas(...) when data coming from the generator has missing values.
Steps to reproduce the bug
import datasets
from datasets import features
def test_to_pandas_works_with_explicit_schema():
common_features = features.Features(
{
"a": features.Value("int64"),
"b": features.List({"c": features.Value("int64")}),
}
)
def row_generator():
data = [{"a": 1, "b": []}, {"a": 1, "b": [{"c": 1}]}]
for row in data:
yield row
d = datasets.IterableDataset.from_generator(row_generator, features=common_features)
for _ in d.to_pandas():
pass
# _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
# .venv/lib/python3.13/site-packages/datasets/iterable_dataset.py:3703: in to_pandas
# table = pa.concat_tables(list(self.with_format("arrow").iter(batch_size=1000)))
# ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
# .venv/lib/python3.13/site-packages/datasets/iterable_dataset.py:2563: in iter
# for key, pa_table in iterator:
# ^^^^^^^^
# .venv/lib/python3.13/site-packages/datasets/iterable_dataset.py:2078: in _iter_arrow
# for key, pa_table in self.ex_iterable._iter_arrow():
# ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
# .venv/lib/python3.13/site-packages/datasets/iterable_dataset.py:599: in _iter_arrow
# yield new_key, pa.Table.from_batches(chunks_buffer)
# ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
# pyarrow/table.pxi:5039: in pyarrow.lib.Table.from_batches
# ???
# pyarrow/error.pxi:155: in pyarrow.lib.pyarrow_internal_check_status
# ???
# _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
# > ???
# E pyarrow.lib.ArrowInvalid: Schema at index 1 was different:
# E a: int64
# E b: list<item: null>
# E vs
# E a: int64
# E b: list<item: struct<c: int64>>
# pyarrow/error.pxi:92: ArrowInvalidExpected behavior
arrow operations use schema provided through features= and not the one inferred from the data
Environment info
- datasets version: 4.4.1
- Platform: macOS-15.7.1-arm64-arm-64bit-Mach-O
- Python version: 3.13.1
- huggingface_hub version: 1.1.4
- PyArrow version: 22.0.0
- Pandas version: 2.3.3
- fsspec version: 2025.10.0
ArjunJagdale
Metadata
Metadata
Assignees
Labels
No labels