Is it possible to only download a fraction size of the dataset? #1896
Replies: 2 comments 2 replies
-
|
Hi! Currently, the |
Beta Was this translation helpful? Give feedback.
-
|
Yes, streaming mode is the key — it lets you process a fraction of a dataset without downloading the full thing: from datasets import load_dataset
# Option 1: Streaming (best for large datasets)
dataset = load_dataset("common_crawl", split="train", streaming=True)
# Take only first N examples without downloading everything
sample = dataset.take(10000)
for example in sample:
process(example)
# Option 2: Split with percentage syntax
dataset = load_dataset("openwebtext", split="train[:5%]")
# Option 3: Slice by index range
dataset = load_dataset("pile", split="train[:50000]")
# Option 4: Download a specific shard
dataset = load_dataset("c4", "en", split="train[0:1]",
data_files={"train": "c4-train.00001-of-01024.json.gz"})Streaming vs. downloading a fraction:
For exploration/prototyping, streaming is usually the right choice: # Streaming with filter and map
dataset = load_dataset("bookcorpus", split="train", streaming=True)
filtered = dataset.filter(lambda x: len(x["text"]) > 500)
processed = filtered.map(lambda x: tokenizer(x["text"]))
small_sample = processed.take(1000)
# Convert to regular dataset if you need random access
from datasets import Dataset
small_ds = Dataset.from_list(list(small_sample))The |
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
-
I want to download the natural questions dataset in a Google Colab notebook, however when running the standard download function it fails because I lack disk space. Is it possible to download the dataset in fractions?
Beta Was this translation helpful? Give feedback.
All reactions