Traceback (most recent call last):
File "~/datasets-bug/explore.py", line 8, in <module>
dataset = dataset.cast_column("audio", Audio(sampling_rate=24000))
File "~/datasets-bug/.venv/lib/python3.14/site-packages/datasets/dataset_dict.py", line 337, in cast_column
return DatasetDict({k: dataset.cast_column(column=column, feature=feature) for k, dataset in self.items()})
~~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "~/datasets-bug/.venv/lib/python3.14/site-packages/datasets/fingerprint.py", line 468, in wrapper
out = func(dataset, *args, **kwargs)
File "~/datasets-bug/.venv/lib/python3.14/site-packages/datasets/arrow_dataset.py", line 2201, in cast_column
dataset._data = dataset._data.cast(dataset.features.arrow_schema)
~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "~/datasets-bug/.venv/lib/python3.14/site-packages/datasets/table.py", line 1124, in cast
return MemoryMappedTable(table_cast(self.table, *args, **kwargs), self.path, replays)
~~~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "~/datasets-bug/.venv/lib/python3.14/site-packages/datasets/table.py", line 2272, in table_cast
return cast_table_to_schema(table, schema)
File "~/datasets-bug/.venv/lib/python3.14/site-packages/datasets/table.py", line 2224, in cast_table_to_schema
cast_array_to_feature(
~~~~~~~~~~~~~~~~~~~~~^
table[name] if name in table_column_names else pa.array([None] * len(table), type=schema.field(name).type),
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
feature,
^^^^^^^^
)
^
File "~/datasets-bug/.venv/lib/python3.14/site-packages/datasets/table.py", line 1795, in wrapper
return pa.chunked_array([func(chunk, *args, **kwargs) for chunk in array.chunks])
~~~~^^^^^^^^^^^^^^^^^^^^^^^^
File "~/datasets-bug/.venv/lib/python3.14/site-packages/datasets/table.py", line 1995, in cast_array_to_feature
return feature.cast_storage(array)
~~~~~~~~~~~~~~~~~~~~^^^^^^^
File "~/datasets-bug/.venv/lib/python3.14/site-packages/datasets/features/audio.py", line 272, in cast_storage
return array_cast(storage, self.pa_type)
File "~/datasets-bug/.venv/lib/python3.14/site-packages/datasets/table.py", line 1797, in wrapper
return func(array, *args, **kwargs)
File "~/datasets-bug/.venv/lib/python3.14/site-packages/datasets/table.py", line 1949, in array_cast
return array.cast(pa_type)
~~~~~~~~~~^^^^^^^^^
File "pyarrow/array.pxi", line 1147, in pyarrow.lib.Array.cast
File "~/datasets-bug/.venv/lib/python3.14/site-packages/pyarrow/compute.py", line 412, in cast
return call_function("cast", [arr], options, memory_pool)
File "pyarrow/_compute.pyx", line 604, in pyarrow._compute.call_function
File "pyarrow/_compute.pyx", line 399, in pyarrow._compute.Function.call
File "pyarrow/error.pxi", line 155, in pyarrow.lib.pyarrow_internal_check_status
File "pyarrow/error.pxi", line 92, in pyarrow.lib.check_status
pyarrow.lib.ArrowNotImplementedError: Unsupported cast from large_string to struct using function cast_struct
The audio column with file paths loaded from a csv can be converted to AudioDecoder objects the same as an identical dataset created from a dict.
Describe the bug
Attempt to load a dataset from a csv with a single
audiocolumn with a single row with a path to an audio file fails when casting the column to Audio, but the exact same dataset created from a dictionary succeeds.Steps to reproduce the bug
audio.wavaudio.csvwith the following content:"audio" "audio.wav"The error is:
Expected behavior
The audio column with file paths loaded from a csv can be converted to AudioDecoder objects the same as an identical dataset created from a dict.
Environment info
datasets 4.3.0 and 4.5.0, Ubuntu 24.04 amd64, python 3.13.11 and 3.14.2