Skip to content

TypeError when setting --dataset.stream=true (finetune pi05) #2366

@sigongzi

Description

@sigongzi

System Info

Copy-and-paste the text below in your GitHub issue and FILL OUT the last point.

- lerobot version: 0.4.1
- Platform: Linux-5.4.0-216-generic-x86_64-with-glibc2.31
- Python version: 3.10.19
- Huggingface Hub version: 0.35.3
- Datasets version: 4.1.1
- Numpy version: 2.2.6
- PyTorch version: 2.7.1+cu126
- Is PyTorch built with CUDA support?: True
- Cuda version: 12.6
- GPU model: NVIDIA GeForce RTX 4090
- Using GPU in script?: <fill in>

Information

  • One of the scripts in the examples/ folder of LeRobot
  • My own task or dataset (give details below)

Reproduction

Running the command in the pi05 with an additional setting of --dataset.stream=true

(I download data locally, and the data is from HuggingFaceVLA/libero

python src/lerobot/scripts/lerobot_train.py\
    --dataset.root=/data1/sijiaqi/lerobot/data/libero \
    --dataset.repo_id=my/data \ 
    --dataset.stream=true \
    --policy.type=pi05 \
    --output_dir=./outputs/pi05_training \
    --job_name=pi05_training \
    --policy.repo_id=your_repo_id \
    --policy.pretrained_path=model/pi05_base \
    --policy.compile_model=true \
    --policy.gradient_checkpointing=true \
    --wandb.enable=true \
    --policy.dtype=bfloat16 \
    --steps=3000 \
    --policy.device=cuda \
    --batch_size=32
The error message is:
INFO 2025-11-03 10:38:53 ot_train.py:247 Creating optimizer and scheduler
INFO 2025-11-03 10:38:53 hedulers.py:105 Auto-scaling LR scheduler: num_training_steps (3000) < num_decay_steps (30000). Scaling warmup: 1000 → 100, decay: 30000 → 3000 (scale factor: 0.100)
INFO 2025-11-03 10:38:53 ot_train.py:259 Output dir: outputs/pi05_training
INFO 2025-11-03 10:38:53 ot_train.py:262 cfg.steps=3000 (3K)
INFO 2025-11-03 10:38:53 ot_train.py:263 dataset.num_frames=273465 (273K)
INFO 2025-11-03 10:38:53 ot_train.py:264 dataset.num_episodes=1693
INFO 2025-11-03 10:38:53 ot_train.py:267 Effective batch size: 32 x 1 = 32
INFO 2025-11-03 10:38:53 ot_train.py:268 num_learnable_params=3616757520 (4B)
INFO 2025-11-03 10:38:53 ot_train.py:269 num_total_params=3616757520 (4B)
INFO 2025-11-03 10:38:53 ot_train.py:324 Start offline training on a fixed dataset
Traceback (most recent call last):
  File "/data1/sijiaqi/lerobot/src/lerobot/scripts/lerobot_train.py", line 448, in <module>
    main()
  File "/data1/sijiaqi/lerobot/src/lerobot/scripts/lerobot_train.py", line 444, in main
    train()
  File "/data1/sijiaqi/lerobot/src/lerobot/configs/parser.py", line 233, in wrapper_inner
    response = fn(cfg, *args, **kwargs)
  File "/data1/sijiaqi/lerobot/src/lerobot/scripts/lerobot_train.py", line 328, in train
    batch = next(dl_iter)
  File "/data1/sijiaqi/lerobot/src/lerobot/datasets/utils.py", line 898, in cycle
    yield next(iterator)
  File "/data1/sijiaqi/conda_env/lerobot/lib/python3.10/site-packages/accelerate/data_loader.py", line 866, in __iter__
    next_batch, next_batch_info = self._fetch_batches(main_iterator)
  File "/data1/sijiaqi/conda_env/lerobot/lib/python3.10/site-packages/accelerate/data_loader.py", line 820, in _fetch_batches
    batches.append(next(iterator))
  File "/data1/sijiaqi/conda_env/lerobot/lib/python3.10/site-packages/torch/utils/data/dataloader.py", line 733, in __next__
    data = self._next_data()
  File "/data1/sijiaqi/conda_env/lerobot/lib/python3.10/site-packages/torch/utils/data/dataloader.py", line 1515, in _next_data
    return self._process_data(data, worker_id)
  File "/data1/sijiaqi/conda_env/lerobot/lib/python3.10/site-packages/torch/utils/data/dataloader.py", line 1550, in _process_data
    data.reraise()
  File "/data1/sijiaqi/conda_env/lerobot/lib/python3.10/site-packages/torch/_utils.py", line 750, in reraise
    raise exception
TypeError: Caught TypeError in DataLoader worker process 0.
Original Traceback (most recent call last):
  File "/data1/sijiaqi/conda_env/lerobot/lib/python3.10/site-packages/torch/utils/data/_utils/collate.py", line 171, in collate
    {
  File "/data1/sijiaqi/conda_env/lerobot/lib/python3.10/site-packages/torch/utils/data/_utils/collate.py", line 172, in <dictcomp>
    key: collate(
  File "/data1/sijiaqi/conda_env/lerobot/lib/python3.10/site-packages/torch/utils/data/_utils/collate.py", line 240, in collate
    raise TypeError(default_collate_err_msg_format.format(elem_type))
TypeError: default_collate: batch must contain tensors, numpy arrays, numbers, dicts or lists; found <class 'PIL.PngImagePlugin.PngImageFile'>

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/data1/sijiaqi/conda_env/lerobot/lib/python3.10/site-packages/torch/utils/data/_utils/worker.py", line 349, in _worker_loop
    data = fetcher.fetch(index)  # type: ignore[possibly-undefined]
  File "/data1/sijiaqi/conda_env/lerobot/lib/python3.10/site-packages/torch/utils/data/_utils/fetch.py", line 43, in fetch
    return self.collate_fn(data)
  File "/data1/sijiaqi/conda_env/lerobot/lib/python3.10/site-packages/torch/utils/data/_utils/collate.py", line 398, in default_collate
    return collate(batch, collate_fn_map=default_collate_fn_map)
  File "/data1/sijiaqi/conda_env/lerobot/lib/python3.10/site-packages/torch/utils/data/_utils/collate.py", line 191, in collate
    return {
  File "/data1/sijiaqi/conda_env/lerobot/lib/python3.10/site-packages/torch/utils/data/_utils/collate.py", line 192, in <dictcomp>
    key: collate([d[key] for d in batch], collate_fn_map=collate_fn_map)
  File "/data1/sijiaqi/conda_env/lerobot/lib/python3.10/site-packages/torch/utils/data/_utils/collate.py", line 240, in collate
    raise TypeError(default_collate_err_msg_format.format(elem_type))
TypeError: default_collate: batch must contain tensors, numpy arrays, numbers, dicts or lists; found <class 'PIL.PngImagePlugin.PngImageFile'>

If --dataset.stream=true is removed, the program runs normally, so I believe this is a bug.

My AI assistant believes the causes of the error are as follows:

The non-streaming LeRobotDataset calls set_transform(hf_transform_to_torch) on the HF dataset, which explicitly converts PIL.Image.Image objects into torch.Tensors (with the format: C, H, W, float32, [0,1]).

The streaming StreamingLeRobotDataset uses load_dataset(..., streaming=True) which returns an IterableDataset. This streaming dataset does not have a transform set; it only calls item_to_torch(item). Currently, this function only converts np.ndarray and list into Tensors, and does not handle PIL.Image.Image objects.

Expected behavior

The program should run normally in the same way as when Stream is not set.

(Although the process was terminated due to an OutOfMemoryError here, I believe this also indicates that the data was processed normally.)

without stream setting
INFO 2025-11-03 09:38:37 ot_train.py:268 num_learnable_params=3616757520 (4B)
INFO 2025-11-03 09:38:37 ot_train.py:269 num_total_params=3616757520 (4B)
INFO 2025-11-03 09:38:37 ot_train.py:324 Start offline training on a fixed dataset
Traceback (most recent call last):
  File "/data1/sijiaqi/lerobot/src/lerobot/scripts/lerobot_train.py", line 448, in <module>
    main()
  File "/data1/sijiaqi/lerobot/src/lerobot/scripts/lerobot_train.py", line 444, in main
    train()
  File "/data1/sijiaqi/lerobot/src/lerobot/configs/parser.py", line 233, in wrapper_inner
    response = fn(cfg, *args, **kwargs)
  File "/data1/sijiaqi/lerobot/src/lerobot/scripts/lerobot_train.py", line 332, in train
    train_tracker, output_dict = update_policy(
  File "/data1/sijiaqi/lerobot/src/lerobot/scripts/lerobot_train.py", line 95, in update_policy
    accelerator.backward(loss)
  File "/data1/sijiaqi/conda_env/lerobot/lib/python3.10/site-packages/accelerate/accelerator.py", line 2740, in backward
    loss.backward(**kwargs)
  File "/data1/sijiaqi/conda_env/lerobot/lib/python3.10/site-packages/torch/_tensor.py", line 648, in backward
    torch.autograd.backward(
  File "/data1/sijiaqi/conda_env/lerobot/lib/python3.10/site-packages/torch/autograd/__init__.py", line 353, in backward
    _engine_run_backward(
  File "/data1/sijiaqi/conda_env/lerobot/lib/python3.10/site-packages/torch/autograd/graph.py", line 824, in _engine_run_backward
    return Variable._execution_engine.run_backward(  # Calls into the C++ engine to run the backward pass
torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 712.00 MiB. GPU 0 has a total capacity of 23.65 GiB of which 633.69 MiB is free. Process 3123961 has 7.92 GiB memory in use. Including non-PyTorch memory, this process has 15.09 GiB memory in use. Of the allocated memory 14.01 GiB is allocated by PyTorch, and 632.42 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions