Skip to content

Conversation

app-token-issuer-releng-renovate[bot]
Copy link
Contributor

This PR contains the following updates:

Package Change Age Confidence
datasets ==2.21.0 -> ==4.0.0 age confidence

Warning

Some dependencies could not be looked up. Check the Dependency Dashboard for more information.


Release Notes

huggingface/datasets (datasets)

v4.0.0

Compare Source

New Features

Build streaming data pipelines in a few lines of code !

from datasets import load_dataset

ds = load_dataset(..., streaming=True)
ds = ds.map(...).filter(...)
ds.push_to_hub(...)


* Add `num_proc=` to `.push_to_hub()` (Dataset and IterableDataset) by @​lhoestq in https://github.com/huggingface/datasets/pull/7606

```python

### Faster push to Hub ! Available for both Dataset and IterableDataset
ds.push_to_hub(..., num_proc=8)

Syntax:

ds["column_name"] # datasets.Column([...]) or datasets.IterableColumn(...)

Iterate on a column:

for text in ds["text"]:
...

Load one cell without bringing the full column in memory

first_text = ds["text"][0] # equivalent to ds[0]["text"]

* Torchcodec decoding by @​TyTodd in https://github.com/huggingface/datasets/pull/7616
- Enables streaming only the ranges you need ! 

```python

### Don't download full audios/videos when it's not necessary
### Now with torchcodec it only streams the required ranges/frames:
from datasets import load_dataset

ds = load_dataset(..., streaming=True)
for example in ds:
    video = example["video"]
    frames = video.get_frames_in_range(start=0, stop=6, step=1)  # only stream certain frames
  • Requires torch>=2.7.0 and FFmpeg >= 4
  • Not available for Windows yet but it is coming soon - in the meantime please use datasets<4.0
  • Load audio data with AudioDecoder:
audio = dataset[0]["audio"]  # <datasets.features._torchcodec.AudioDecoder object at 0x11642b6a0>
samples = audio.get_all_samples()  # or use get_samples_played_in_range(...)
samples.data  # tensor([[ 0.0000e+00,  0.0000e+00,  0.0000e+00,  ...,  2.3447e-06, -1.9127e-04, -5.3330e-05]]
samples.sample_rate  # 16000

### old syntax is still supported
array, sr = audio["array"], audio["sampling_rate"]
  • Load video data with VideoDecoder:
video = dataset[0]["video"] <torchcodec.decoders._video_decoder.VideoDecoder object at 0x14a61d5a0>
first_frame = video.get_frame_at(0)
first_frame.data.shape  # (3, 240, 320)
first_frame.pts_seconds  # 0.0
frames = video.get_frames_in_range(0, 6, 1)
frames.data.shape  # torch.Size([5, 3, 240, 320])

Breaking changes

  • Remove scripts altogether by @​lhoestq in https://github.com/huggingface/datasets/pull/7592

    • trust_remote_code is no longer supported
  • Torchcodec decoding by @​TyTodd in https://github.com/huggingface/datasets/pull/7616

    • torchcodec replaces soundfile for audio decoding
    • torchcodec replaces decord for video decoding
  • Replace Sequence by List by @​lhoestq in https://github.com/huggingface/datasets/pull/7634

    • Introduction of the List type
    from datasets import Features, List, Value
    
    features = Features({
        "texts": List(Value("string")),
        "four_paragraphs": List(Value("string"), length=4)
    })
    • Sequence was a legacy type from tensorflow datasets which converted list of dicts to dicts of lists. It is no longer a type but it becomes a utility that returns a List or a dict depending on the subfeature
    from datasets import Sequence
    
    Sequence(Value("string"))  # List(Value("string"))
    Sequence({"texts": Value("string")})  # {"texts": List(Value("string"))}

Other improvements and bug fixes

New Contributors

Full Changelog: huggingface/datasets@3.6.0...4.0.0

v3.6.0

Compare Source

Dataset Features

Other improvements and bug fixes

New Contributors

Full Changelog: huggingface/datasets@3.5.1...3.6.0

v3.5.1

Compare Source

Bug fixes

Other improvements

New Contributors

Full Changelog: huggingface/datasets@3.5.0...3.5.1

v3.5.0

Compare Source

Datasets Features

>>> from datasets import load_dataset, Pdf
>>> repo = "path/to/pdf/folder"  # or username/dataset_name on Hugging Face
>>> dataset = load_dataset(repo, split="train")
>>> dataset[0]["pdf"]
<pdfplumber.pdf.PDF at 0x1075bc320>
>>> dataset[0]["pdf"].pages[0].extract_text()
...

What's Changed

New Contributors

Full Changelog: huggingface/datasets@3.4.1...3.5.0

v3.4.1

Compare Source

Bug Fixes

Full Changelog: huggingface/datasets@3.4.0...3.4.1

v3.4.0

Compare Source

Dataset Features

  • Faster folder based builder + parquet support + allow repeated media + use torchvideo by @​lhoestq in https://github.com/huggingface/datasets/pull/7424

    • /!\ Breaking change: we replaced decord with torchvision to read videos, since decord is not maintained anymore and isn't available for recent python versions, see the video dataset loading documentation here for more details. The Video type is still marked as experimental is this version
    from datasets import load_dataset, Video
    
    dataset = load_dataset("path/to/video/folder", split="train")
    dataset[0]["video"]  # <torchvision.io.video_reader.VideoReader at 0x1652284c0>
    • faster streaming for image/audio/video folder from Hugging Face
    • support for metadata.parquet in addition to metadata.csv or metadata.jsonl for the metadata of the image/audio/video files
  • Add IterableDataset.decode with multithreading by @​lhoestq in https://github.com/huggingface/datasets/pull/7450

    • even faster streaming for image/audio/video folder from Hugging Face if you enable multithreading to decode image/audio/video data:
    dataset = dataset.decode(num_threads=num_threads)
  • Add with_split to DatasetDict.map by @​jp1924 in https://github.com/huggingface/datasets/pull/7368

General improvements and bug fixes

New Contributors

Full Changelog: huggingface/datasets@3.3.2...3.4.0

v3.3.2

Compare Source

Bug fixes

Other general improvements

New Contributors

Full Changelog: huggingface/datasets@3.3.1...3.3.2

v3.3.1

Compare Source

Bug fixes

Full Changelog: huggingface/datasets@3.3.0...3.3.1

v3.3.0

Compare Source

Dataset Features

  • Support async functions in map() by @​lhoestq in https://github.com/huggingface/datasets/pull/7384

    • Especially useful to download content like images or call inference APIs
    prompt = "Answer the following question: {question}. You should think step by step."
    async def ask_llm(example):
        return await query_model(prompt.format(question=example["question"]))
    ds = ds.map(ask_llm)
  • Add repeat method to datasets by @​alex-hh in https://github.com/huggingface/datasets/pull/7198

    ds = ds.repeat(10)
  • Support faster processing using pandas or polars functions in IterableDataset.map() by @​lhoestq in https://github.com/huggingface/datasets/pull/7370

    • Add support for "pandas" and "polars" formats in IterableDatasets
    • This enables optimized data processing using pandas or polars functions with zero-copy, e.g.
    ds = load_dataset("ServiceNow-AI/R1-Distill-SFT", "v0", split="train", streaming=True)
    ds = ds.with_format("polars")
    expr = pl.col("solution").str.extract("boxed\\{(.*)\\}").alias("value_solution")
    ds = ds.map(lambda df: df.with_columns(expr), batched=True)
  • Apply formatting after iter_arrow to speed up format -> map, filter for iterable datasets by @​alex-hh in https://github.com/huggingface/datasets/pull/7207

    • IterableDatasets with "numpy" format are now much faster

What's Changed

New Contributors

Full Changelog: huggingface/datasets@3.2.0...3.3.0

v3.2.0

Compare Source

Dataset Features

  • Faster parquet streaming + filters with predicate pushdown by @​lhoestq in https://github.com/huggingface/datasets/pull/7309
    • Up to +100% streaming speed
    • Fast filtering via predicate pushdown (skip files/row groups based on predicate instead of downloading the full data), e.g.
      from datasets import load_dataset
      filters = [('date', '>=', '2023')]
      ds = load_dataset("HuggingFaceFW/fineweb-2", "fra_Latn", streaming=True, filters=filters)

Other improvements and bug fixes

New Contributors

Full Changelog: huggingface/datasets@3.1.0...3.2.0

v3.1.0

Compare Source

Dataset Features

What's Changed

New Contributors

Full Changelog: huggingface/datasets@3.0.2...3.1.0

v3.0.2

Compare Source

Main bug fixes

What's Changed

New Contributors

Full Changelog: huggingface/datasets@3.0.1...3.0.2

v3.0.1

Compare Source

What's Changed

New Contributors

Full Changelog: huggingface/datasets@3.0.0...3.0.1

v3.0.0

Compare Source

Dataset Features

  • Use Polars functions in .map()
    • Allow Polars as valid output type by @​psmyth94 in https://github.com/huggingface/datasets/pull/6762

    • Example:

      >>> from datasets import load_dataset
      >>> ds = load_dataset("lhoestq/CudyPokemonAdventures", split="train").with_format("polars")
      >>> cols = [pl.col("content").str.len_bytes().alias("length")]
      >>> ds_with_length = ds.map(lambda df: df.with_columns(cols), batched=True)
      >>> ds_with_length[:5]
      shape: (5, 5)
      ┌─────┬───────────────────────────────────┬───────────────────────────────────┬───────────────────────┬────────┐
      │ idxtitlecontentlabelslength │
      │ ---------------    │
      │ i64strstrstru32    │
      ╞═════╪═══════════════════════════════════╪═══════════════════════════════════╪═══════════════════════╪════════╡
      │ 0The Joyful Adventure of Bulbasau… ┆ Bulbasaur embarked on a sunny qu… ┆ joyful_adventure180    │
      │ 1Pikachu's Quest for PeacePikachu, with his cheeky persona… ┆ peaceful_narrative138    │
      │ 2The Tender Tale of SquirtleSquirtle took everyone on a memo… ┆ gentle_adventure135    │
      │ 3Charizard's Heartwarming TaleCharizard found joy in helping o… ┆ heartwarming_story112    │
      │ 4Jolteon's Sparkling JourneyJolteon, with his zest for life,… ┆ celebratory_narrative111    │
      └─────┴───────────────────────────────────┴───────────────────────────────────┴───────────────────────┴────────┘
  • Support NumPy 2

Cache Changes

  • Use huggingface_hub cache by @​lhoestq in https://github.com/huggingface/datasets/pull/7105
    • use the huggingface_hub cache for files downloaded from HF, by default at ~/.cache/huggingface/hub
    • cached datasets (Arrow files) will still be reloaded from the datasets cache, by default at ~/.cache/huggingface/datasets

Breaking changes

General improvements and bug fixes

New Contributors

Full Changelog: huggingface/datasets@2.21.0...3.0.0


Configuration

📅 Schedule: Branch creation - At any time (no schedule defined), Automerge - At any time (no schedule defined).

🚦 Automerge: Disabled by config. Please merge this manually once you are satisfied.

Rebasing: Whenever PR is behind base branch, or you tick the rebase/retry checkbox.

🔕 Ignore: Close this PR and you won't be reminded about this update again.


  • If you want to rebase/retry this PR, check this box

This PR has been generated by Renovate Bot.

@app-token-issuer-releng-renovate
Copy link
Contributor Author

Edited/Blocked Notification

Renovate will not automatically rebase this PR, because it does not recognize the last commit author and assumes somebody else may have edited the PR.

You can manually request rebase by checking the rebase/retry box above.

⚠️ Warning: custom changes will be lost.

Copy link
Contributor

This PR is stale because it has been open 30 days with no activity.
Remove the stale label or comment or this will be closed in 7 days.

@github-actions github-actions bot added the Stale Stale PRs will be closed after 7 days unless the label is removed or new updates are made. label Aug 16, 2025
Copy link
Contributor

This PR has been automatically closed because it has been stale for > 30 days.
If you wish to continue working on this PR, please reopen it and make any necessary changes.

@github-actions github-actions bot closed this Aug 24, 2025
@github-actions github-actions bot deleted the renovate/datasets-4.x branch August 24, 2025 00:47
@app-token-issuer-releng-renovate
Copy link
Contributor Author

Renovate Ignore Notification

Because you closed this PR without merging, Renovate will ignore this update. You will not get PRs for any future 4.x releases. But if you manually upgrade to 4.x then Renovate will re-enable minor and patch updates automatically.

If you accidentally closed this PR, or if you changed your mind: rename this PR to get a fresh replacement PR.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

renovate Stale Stale PRs will be closed after 7 days unless the label is removed or new updates are made.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants