-
Notifications
You must be signed in to change notification settings - Fork 3k
Description
Describe the bug
When saving a dataset on disk and it has a single shard it is not loaded as when it is saved in multiple shards. I installed the latest version of datasets via pip.
Steps to reproduce the bug
The code below reproduces the behavior. All works well when the range of the loop is 10000 but it fails when it is 1000.
from PIL import Image
import numpy as np
from datasets import Dataset, DatasetDict, load_dataset
def load_image():
# Generate random noise image
noise = np.random.randint(0, 256, (256, 256, 3), dtype=np.uint8)
return Image.fromarray(noise)
def create_dataset():
input_images = []
output_images = []
text_prompts = []
for _ in range(10000): # this is the problematic parameter
input_images.append(load_image())
output_images.append(load_image())
text_prompts.append('test prompt')
data = {'input_image': input_images, 'output_image': output_images, 'text_prompt': text_prompts}
dataset = Dataset.from_dict(data)
return DatasetDict({'train': dataset})
dataset = create_dataset()
print('dataset before saving')
print(dataset)
print(dataset['train'].column_names)
dataset.save_to_disk('test_ds')
print('dataset after loading')
dataset_loaded = load_dataset('test_ds')
print(dataset_loaded)
print(dataset_loaded['train'].column_names)
The output for 1000 iterations is:
dataset before saving
DatasetDict({
train: Dataset({
features: ['input_image', 'output_image', 'text_prompt'],
num_rows: 1000
})
})
['input_image', 'output_image', 'text_prompt']
Saving the dataset (1/1 shards): 100%|█| 1000/1000 [00:00<00:00, 5156.00 example
dataset after loading
Generating train split: 1 examples [00:00, 230.52 examples/s]
DatasetDict({
train: Dataset({
features: ['_data_files', '_fingerprint', '_format_columns', '_format_kwargs', '_format_type', '_output_all_columns', '_split'],
num_rows: 1
})
})
['_data_files', '_fingerprint', '_format_columns', '_format_kwargs', '_format_type', '_output_all_columns', '_split']
For 10000 iteration (8 shards) it is correct:
dataset before saving
DatasetDict({
train: Dataset({
features: ['input_image', 'output_image', 'text_prompt'],
num_rows: 10000
})
})
['input_image', 'output_image', 'text_prompt']
Saving the dataset (8/8 shards): 100%|█| 10000/10000 [00:01<00:00, 6237.68 examp
dataset after loading
Generating train split: 10000 examples [00:00, 10773.16 examples/s]
DatasetDict({
train: Dataset({
features: ['input_image', 'output_image', 'text_prompt'],
num_rows: 10000
})
})
['input_image', 'output_image', 'text_prompt']
Expected behavior
The procedure should work for a dataset with one shrad the same as for one with multiple shards
Environment info
datasets
version: 2.18.0- Platform: macOS-14.1-arm64-arm-64bit
- Python version: 3.11.8
huggingface_hub
version: 0.22.2- PyArrow version: 15.0.2
- Pandas version: 2.2.2
fsspec
version: 2024.2.0
Edit: I looked in the source code of load.py in datasets. I should have used "load_from_disk" and it indeed works that way. But ideally load_dataset would have raisen an error the same way as if I call a path:
if Path(path, config.DATASET_STATE_JSON_FILENAME).exists():
raise ValueError(
"You are trying to load a dataset that was saved using `save_to_disk`. "
"Please use `load_from_disk` instead."
)
nevertheless I find it interesting that it works just well and without a warning if there are multiple shards.