-
Notifications
You must be signed in to change notification settings - Fork 12
Open
Description
Since no default dataset config is published, and I would like to iterate on diverse data, I tried:
from datasets import get_dataset_config_names, load_dataset, interleave_datasets
configs = get_dataset_config_names("HuggingFaceFW/fineweb-2")
print(configs)
streams = [
load_dataset("HuggingFaceFW/fineweb-2", c, split="train", streaming=True)
for c in configs
]
# Option A: round-robin (equal mixing across languages)
ds = interleave_datasets(streams, seed=42)
# ds is now an IterableDataset; languages are naturally mixed as you iterate.
for ex in ds.take(3):
print(ex.keys())This prints all configs (['aai_Latn', 'aak_Latn', 'aau_Latn', 'aaz_Latn',....)
and then:
ValueError: At least one valid data file must be specified, all the data_files are invalid: {'test': [], 'train': ['hf://datasets/HuggingFaceFW/fineweb-2@af9c13333eb981300149d5ca60a8e9d659b276b9/data/abi_Latn/train/000_00000.parquet']}
Minimally:
from datasets import load_dataset
ds = load_dataset("HuggingFaceFW/fineweb-2", "abi_Latn", split="train", streaming=True)
ds.take(1)Works on my mac, fails on my server.
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
No labels