Skip to content

IterableDataset does not use features information in to_pandas #7872

@bonext

Description

@bonext

Describe the bug

IterableDataset created from generator with explicit features= parameter seems to ignore provided features description for certain operations, e.g. .to_pandas(...) when data coming from the generator has missing values.

Steps to reproduce the bug

import datasets
from datasets import features


def test_to_pandas_works_with_explicit_schema():
    common_features = features.Features(
        {
            "a": features.Value("int64"),
            "b": features.List({"c": features.Value("int64")}),
        }
    )

    def row_generator():
        data = [{"a": 1, "b": []}, {"a": 1, "b": [{"c": 1}]}]
        for row in data:
            yield row

    d = datasets.IterableDataset.from_generator(row_generator, features=common_features)


    for _ in d.to_pandas():
        pass
        # _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
        # .venv/lib/python3.13/site-packages/datasets/iterable_dataset.py:3703: in to_pandas
        #     table = pa.concat_tables(list(self.with_format("arrow").iter(batch_size=1000)))
        #                              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
        # .venv/lib/python3.13/site-packages/datasets/iterable_dataset.py:2563: in iter
        #     for key, pa_table in iterator:
        #                          ^^^^^^^^
        # .venv/lib/python3.13/site-packages/datasets/iterable_dataset.py:2078: in _iter_arrow
        #     for key, pa_table in self.ex_iterable._iter_arrow():
        #                          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
        # .venv/lib/python3.13/site-packages/datasets/iterable_dataset.py:599: in _iter_arrow
        #     yield new_key, pa.Table.from_batches(chunks_buffer)
        #                    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
        # pyarrow/table.pxi:5039: in pyarrow.lib.Table.from_batches
        #     ???
        # pyarrow/error.pxi:155: in pyarrow.lib.pyarrow_internal_check_status
        #     ???
        # _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 

        # >   ???
        # E   pyarrow.lib.ArrowInvalid: Schema at index 1 was different: 
        # E   a: int64
        # E   b: list<item: null>
        # E   vs
        # E   a: int64
        # E   b: list<item: struct<c: int64>>

        # pyarrow/error.pxi:92: ArrowInvalid

Expected behavior

arrow operations use schema provided through features= and not the one inferred from the data

Environment info

  • datasets version: 4.4.1
  • Platform: macOS-15.7.1-arm64-arm-64bit-Mach-O
  • Python version: 3.13.1
  • huggingface_hub version: 1.1.4
  • PyArrow version: 22.0.0
  • Pandas version: 2.3.3
  • fsspec version: 2025.10.0

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions