Skip to content

Loading problems of Datasets with a single shard #6823

@andjoer

Description

@andjoer

Describe the bug

When saving a dataset on disk and it has a single shard it is not loaded as when it is saved in multiple shards. I installed the latest version of datasets via pip.

Steps to reproduce the bug

The code below reproduces the behavior. All works well when the range of the loop is 10000 but it fails when it is 1000.

from PIL import Image
import numpy as np
from datasets import Dataset, DatasetDict, load_dataset


def load_image():
    # Generate random noise image
    noise = np.random.randint(0, 256, (256, 256, 3), dtype=np.uint8)  
    return Image.fromarray(noise)

def create_dataset():
    input_images = []
    output_images = []
    text_prompts = []
  
    for _ in range(10000):  # this is the problematic parameter
        input_images.append(load_image())
        output_images.append(load_image())
        text_prompts.append('test prompt')

    data = {'input_image': input_images, 'output_image': output_images, 'text_prompt': text_prompts}
    dataset = Dataset.from_dict(data)
    
    return DatasetDict({'train': dataset})

dataset = create_dataset()

print('dataset before saving')
print(dataset)
print(dataset['train'].column_names)
dataset.save_to_disk('test_ds')
print('dataset after loading')


dataset_loaded = load_dataset('test_ds')
print(dataset_loaded)
print(dataset_loaded['train'].column_names)

The output for 1000 iterations is:

dataset before saving
DatasetDict({
    train: Dataset({
        features: ['input_image', 'output_image', 'text_prompt'],
        num_rows: 1000
    })
})
['input_image', 'output_image', 'text_prompt']
Saving the dataset (1/1 shards): 100%|█| 1000/1000 [00:00<00:00, 5156.00 example
dataset after loading
Generating train split: 1 examples [00:00, 230.52 examples/s]
DatasetDict({
    train: Dataset({
        features: ['_data_files', '_fingerprint', '_format_columns', '_format_kwargs', '_format_type', '_output_all_columns', '_split'],
        num_rows: 1
    })
})
['_data_files', '_fingerprint', '_format_columns', '_format_kwargs', '_format_type', '_output_all_columns', '_split']

For 10000 iteration (8 shards) it is correct:

dataset before saving
DatasetDict({
    train: Dataset({
        features: ['input_image', 'output_image', 'text_prompt'],
        num_rows: 10000
    })
})
['input_image', 'output_image', 'text_prompt']
Saving the dataset (8/8 shards): 100%|█| 10000/10000 [00:01<00:00, 6237.68 examp
dataset after loading
Generating train split: 10000 examples [00:00, 10773.16 examples/s]
DatasetDict({
    train: Dataset({
        features: ['input_image', 'output_image', 'text_prompt'],
        num_rows: 10000
    })
})
['input_image', 'output_image', 'text_prompt']

Expected behavior

The procedure should work for a dataset with one shrad the same as for one with multiple shards

Environment info

  • datasets version: 2.18.0
  • Platform: macOS-14.1-arm64-arm-64bit
  • Python version: 3.11.8
  • huggingface_hub version: 0.22.2
  • PyArrow version: 15.0.2
  • Pandas version: 2.2.2
  • fsspec version: 2024.2.0

Edit: I looked in the source code of load.py in datasets. I should have used "load_from_disk" and it indeed works that way. But ideally load_dataset would have raisen an error the same way as if I call a path:

    if Path(path, config.DATASET_STATE_JSON_FILENAME).exists():
        raise ValueError(
            "You are trying to load a dataset that was saved using `save_to_disk`. "
            "Please use `load_from_disk` instead."
        )

nevertheless I find it interesting that it works just well and without a warning if there are multiple shards.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions