Describe the bug
Description of the bug
Dataset.map crashes with writer is None when the map function returns None for the first few examples and a dictionary (or pa.Table / DataFrame) for later examples. This happens because the internal writer is initialized only when i == 0 (or i[0] == 0 in batched mode), but update_data is determined lazily after processing the first example/batch.
Steps to reproduce
from datasets import Dataset
ds = Dataset.from_dict({"x": [1, 2, 3]})
def fn(example, idx):
if idx < 2:
return None
return {"x": [example["x"] * 10]}
list(ds.map(fn, with_indices=True))
Expected behavior
- The function should work regardless of when
update_data becomes True.
- Writer should be initialized the first time a non-
None return occurs, not tied to the first index.
Environment info
datasets version:
- Python version: 3.12
- OS:
Suggested fix
Replace if i == 0 / if i[0] == 0 checks with if writer is None when initializing the writer.
Steps to reproduce the bug
Here's a ready-to-use version you can paste into that section:
Steps to reproduce the bug
from datasets import Dataset
# Create a minimal dataset
ds = Dataset.from_dict({"x": [1, 2, 3]})
# Define a map function that returns None for first examples, dict later
def fn(example, idx):
if idx < 2:
return None
return {"x": [example["x"] * 10]}
# Apply map with indices
list(ds.map(fn, with_indices=True))
Expected: function executes without errors.
Observed: crashes with AttributeError: 'NoneType' object has no attribute 'write' because the internal writer is not initialized when the first non-None return happens after i > 0.
This is minimal and clearly demonstrates the exact failure condition (None early, dict later).
Expected behavior
Expected behavior
The Dataset.map function should handle map functions that return None for some examples and a dictionary (or pa.Table / DataFrame) for later examples. In this case, the internal writer should be initialized when the first non-None value is returned, so that the dataset can be updated without crashing. The code should run successfully for all examples and return the updated dataset.
Environment info
- python3.12
- datasets==3.6.0 [but the latest version still has this problem]
- transformers==4.55.2
Describe the bug
I detected a serious bug from datasets/arrow_dataset.py
Description of the bug
Dataset.mapcrashes withwriter is Nonewhen the map function returnsNonefor the first few examples and a dictionary (orpa.Table/ DataFrame) for later examples. This happens because the internal writer is initialized only wheni == 0(ori[0] == 0in batched mode), butupdate_datais determined lazily after processing the first example/batch.Steps to reproduce
Expected behavior
update_databecomesTrue.Nonereturn occurs, not tied to the first index.Environment info
datasetsversion:Suggested fix
Replace
if i == 0/if i[0] == 0checks withif writer is Nonewhen initializing the writer.Steps to reproduce the bug
Here's a ready-to-use version you can paste into that section:
Steps to reproduce the bug
Expected: function executes without errors.
Observed: crashes with
AttributeError: 'NoneType' object has no attribute 'write'because the internal writer is not initialized when the first non-None return happens after i > 0.This is minimal and clearly demonstrates the exact failure condition (
Noneearly,dictlater).Expected behavior
Expected behavior
The
Dataset.mapfunction should handle map functions that returnNonefor some examples and a dictionary (orpa.Table/ DataFrame) for later examples. In this case, the internal writer should be initialized when the first non-Nonevalue is returned, so that the dataset can be updated without crashing. The code should run successfully for all examples and return the updated dataset.Environment info