Skip to content

Fails to load dataset #9

@AmitMY

Description

@AmitMY

Since no default dataset config is published, and I would like to iterate on diverse data, I tried:

from datasets import get_dataset_config_names, load_dataset, interleave_datasets

configs = get_dataset_config_names("HuggingFaceFW/fineweb-2")
print(configs)

streams = [
    load_dataset("HuggingFaceFW/fineweb-2", c, split="train", streaming=True)
    for c in configs
]

# Option A: round-robin (equal mixing across languages)
ds = interleave_datasets(streams, seed=42)

# ds is now an IterableDataset; languages are naturally mixed as you iterate.
for ex in ds.take(3):
    print(ex.keys())

This prints all configs (['aai_Latn', 'aak_Latn', 'aau_Latn', 'aaz_Latn',....)
and then:

ValueError: At least one valid data file must be specified, all the data_files are invalid: {'test': [], 'train': ['hf://datasets/HuggingFaceFW/fineweb-2@af9c13333eb981300149d5ca60a8e9d659b276b9/data/abi_Latn/train/000_00000.parquet']}

Minimally:

from datasets import load_dataset

ds = load_dataset("HuggingFaceFW/fineweb-2", "abi_Latn", split="train", streaming=True)
ds.take(1)

Works on my mac, fails on my server.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions