chore(deps): update dependency datasets to v4 #1138

app-token-issuer-releng-renovate · 2025-07-16T00:03:10Z

This PR contains the following updates:

Package	Change	Age	Confidence
datasets	`==2.21.0` -> `==4.0.0`

Warning

Some dependencies could not be looked up. Check the Dependency Dashboard for more information.

Release Notes

huggingface/datasets (datasets)

`v4.0.0`

Compare Source

New Features

Add IterableDataset.push_to_hub() by @lhoestq in https://github.com/huggingface/datasets/pull/7595

Build streaming data pipelines in a few lines of code !

from datasets import load_dataset

ds = load_dataset(..., streaming=True)
ds = ds.map(...).filter(...)
ds.push_to_hub(...)


* Add `num_proc=` to `.push_to_hub()` (Dataset and IterableDataset) by @&#8203;lhoestq in https://github.com/huggingface/datasets/pull/7606

```python

### Faster push to Hub ! Available for both Dataset and IterableDataset
ds.push_to_hub(..., num_proc=8)

New Column object
- Implementation of iteration over values of a column in an IterableDataset object by @TopCoder2K in https://github.com/huggingface/datasets/pull/7564
- Lazy column by @lhoestq in https://github.com/huggingface/datasets/pull/7614

Syntax:

ds["column_name"] # datasets.Column([...]) or datasets.IterableColumn(...)

Iterate on a column:

for text in ds["text"]:
...

Load one cell without bringing the full column in memory

first_text = ds["text"][0] # equivalent to ds[0]["text"]

* Torchcodec decoding by @&#8203;TyTodd in https://github.com/huggingface/datasets/pull/7616
- Enables streaming only the ranges you need ! 

```python

### Don't download full audios/videos when it's not necessary
### Now with torchcodec it only streams the required ranges/frames:
from datasets import load_dataset

ds = load_dataset(..., streaming=True)
for example in ds:
    video = example["video"]
    frames = video.get_frames_in_range(start=0, stop=6, step=1)  # only stream certain frames

Requires torch>=2.7.0 and FFmpeg >= 4
Not available for Windows yet but it is coming soon - in the meantime please use datasets<4.0
Load audio data with AudioDecoder:

audio = dataset[0]["audio"]  # <datasets.features._torchcodec.AudioDecoder object at 0x11642b6a0>
samples = audio.get_all_samples()  # or use get_samples_played_in_range(...)
samples.data  # tensor([[ 0.0000e+00,  0.0000e+00,  0.0000e+00,  ...,  2.3447e-06, -1.9127e-04, -5.3330e-05]]
samples.sample_rate  # 16000

### old syntax is still supported
array, sr = audio["array"], audio["sampling_rate"]

Load video data with VideoDecoder:

video = dataset[0]["video"] <torchcodec.decoders._video_decoder.VideoDecoder object at 0x14a61d5a0>
first_frame = video.get_frame_at(0)
first_frame.data.shape  # (3, 240, 320)
first_frame.pts_seconds  # 0.0
frames = video.get_frames_in_range(0, 6, 1)
frames.data.shape  # torch.Size([5, 3, 240, 320])

Breaking changes

Remove scripts altogether by @lhoestq in https://github.com/huggingface/datasets/pull/7592
- trust_remote_code is no longer supported
Torchcodec decoding by @TyTodd in https://github.com/huggingface/datasets/pull/7616
- torchcodec replaces soundfile for audio decoding
- torchcodec replaces decord for video decoding

Replace Sequence by List by @lhoestq in https://github.com/huggingface/datasets/pull/7634

Introduction of the List type

from datasets import Features, List, Value

features = Features({
    "texts": List(Value("string")),
    "four_paragraphs": List(Value("string"), length=4)
})

Sequence was a legacy type from tensorflow datasets which converted list of dicts to dicts of lists. It is no longer a type but it becomes a utility that returns a List or a dict depending on the subfeature

from datasets import Sequence

Sequence(Value("string"))  # List(Value("string"))
Sequence({"texts": Value("string")})  # {"texts": List(Value("string"))}

Other improvements and bug fixes

Add try_original_type to DatasetDict.map by @yoshitomo-matsubara in https://github.com/huggingface/datasets/pull/7544
Avoid global umask for setting file mode. by @ryan-clancy in https://github.com/huggingface/datasets/pull/7547
Rebatch arrow iterables before formatted iterable by @lhoestq in https://github.com/huggingface/datasets/pull/7553
Document the HF_DATASETS_CACHE environment variable in the datasets cache documentation by @Harry-Yang0518 in https://github.com/huggingface/datasets/pull/7532
fix regression by @lhoestq in https://github.com/huggingface/datasets/pull/7558
fix: Image Feature in Datasets Library Fails to Handle bytearray Objects from Spark DataFrames (#7517) by @giraffacarp in https://github.com/huggingface/datasets/pull/7521
Remove aiohttp from direct dependencies by @akx in https://github.com/huggingface/datasets/pull/7294

New Contributors

@ryan-clancy made their first contribution in https://github.com/huggingface/datasets/pull/7547
@Harry-Yang0518 made their first contribution in https://github.com/huggingface/datasets/pull/7532
@giraffacarp made their first contribution in https://github.com/huggingface/datasets/pull/7521
@akx made their first contribution in https://github.com/huggingface/datasets/pull/7294

Full Changelog: huggingface/datasets@3.5.1...3.6.0

`v3.5.1`

Compare Source

Bug fixes

support pyarrow 20 by @lhoestq in https://github.com/huggingface/datasets/pull/7540
- Fix pyarrow error TypeError: ArrayExtensionArray.to_pylist() got an unexpected keyword argument 'maps_as_pydicts'
Write pdf in map by @lhoestq in https://github.com/huggingface/datasets/pull/7487

Other improvements

update fsspec 2025.3.0 by @peteski22 in https://github.com/huggingface/datasets/pull/7478
Support underscore int read instruction by @lhoestq in https://github.com/huggingface/datasets/pull/7488
Support skip_trying_type by @yoshitomo-matsubara in https://github.com/huggingface/datasets/pull/7483
pdf docs fixes by @lhoestq in https://github.com/huggingface/datasets/pull/7519
Remove conditions for Python < 3.9 by @cyyever in https://github.com/huggingface/datasets/pull/7474
mention av in video docs by @lhoestq in https://github.com/huggingface/datasets/pull/7523
correct use with polars example by @SiQube in https://github.com/huggingface/datasets/pull/7524
chore: fix typos by @afuetterer in https://github.com/huggingface/datasets/pull/7436

New Contributors

@peteski22 made their first contribution in https://github.com/huggingface/datasets/pull/7478
@yoshitomo-matsubara made their first contribution in https://github.com/huggingface/datasets/pull/7483
@SiQube made their first contribution in https://github.com/huggingface/datasets/pull/7524
@afuetterer made their first contribution in https://github.com/huggingface/datasets/pull/7436

Full Changelog: huggingface/datasets@3.5.0...3.5.1

`v3.5.0`

Compare Source

Datasets Features

Introduce PDF support (#7318) by @yabramuvdi in https://github.com/huggingface/datasets/pull/7325

>>> from datasets import load_dataset, Pdf
>>> repo = "path/to/pdf/folder"  # or username/dataset_name on Hugging Face
>>> dataset = load_dataset(repo, split="train")
>>> dataset[0]["pdf"]
<pdfplumber.pdf.PDF at 0x1075bc320>
>>> dataset[0]["pdf"].pages[0].extract_text()
...

What's Changed

Fix local pdf loading by @lhoestq in https://github.com/huggingface/datasets/pull/7466
Minor fix for metadata files in extension counter by @lhoestq in https://github.com/huggingface/datasets/pull/7464
Priotitize json by @lhoestq in https://github.com/huggingface/datasets/pull/7476

New Contributors

@yabramuvdi made their first contribution in https://github.com/huggingface/datasets/pull/7325

Full Changelog: huggingface/datasets@3.4.1...3.5.0

`v3.4.1`

Compare Source

Bug Fixes

Fix data_files filtering by @lhoestq in https://github.com/huggingface/datasets/pull/7459

Full Changelog: huggingface/datasets@3.4.0...3.4.1

`v3.4.0`

Compare Source

Dataset Features

Faster folder based builder + parquet support + allow repeated media + use torchvideo by @lhoestq in https://github.com/huggingface/datasets/pull/7424
- /!\ Breaking change: we replaced decord with torchvision to read videos, since decord is not maintained anymore and isn't available for recent python versions, see the video dataset loading documentation here for more details. The Video type is still marked as experimental is this version
```
from datasets import load_dataset, Video

dataset = load_dataset("path/to/video/folder", split="train")
dataset[0]["video"]  # <torchvision.io.video_reader.VideoReader at 0x1652284c0>
```
- faster streaming for image/audio/video folder from Hugging Face
- support for metadata.parquet in addition to metadata.csv or metadata.jsonl for the metadata of the image/audio/video files
Add IterableDataset.decode with multithreading by @lhoestq in https://github.com/huggingface/datasets/pull/7450
- even faster streaming for image/audio/video folder from Hugging Face if you enable multithreading to decode image/audio/video data:
```
dataset = dataset.decode(num_threads=num_threads)
```
Add with_split to DatasetDict.map by @jp1924 in https://github.com/huggingface/datasets/pull/7368

General improvements and bug fixes

fix: None default with bool type on load creates typing error by @stephantul in https://github.com/huggingface/datasets/pull/7426
Use pyupgrade --py39-plus by @cyyever in https://github.com/huggingface/datasets/pull/7428
Refactor string_to_dict to return None if there is no match instead of raising ValueError by @ringohoffman in https://github.com/huggingface/datasets/pull/7435
Fix small bugs with async map by @lhoestq in https://github.com/huggingface/datasets/pull/7445
Fix resuming after ds.set_epoch(new_epoch) by @lhoestq in https://github.com/huggingface/datasets/pull/7451
minor docs changes by @lhoestq in https://github.com/huggingface/datasets/pull/7452

New Contributors

@stephantul made their first contribution in https://github.com/huggingface/datasets/pull/7426
@cyyever made their first contribution in https://github.com/huggingface/datasets/pull/7428
@jp1924 made their first contribution in https://github.com/huggingface/datasets/pull/7368

Full Changelog: huggingface/datasets@3.3.2...3.4.0

`v3.3.2`

Compare Source

Bug fixes

Attempt to fix multiprocessing hang by closing and joining the pool before termination by @dakinggg in https://github.com/huggingface/datasets/pull/7411
Gracefully cancel async tasks by @lhoestq in https://github.com/huggingface/datasets/pull/7414

Other general improvements

Update use_with_pandas.mdx: to_pandas() correction in last section by @ibarrien in https://github.com/huggingface/datasets/pull/7407
Fix a typo in arrow_dataset.py by @jingedawang in https://github.com/huggingface/datasets/pull/7402

New Contributors

@dakinggg made their first contribution in https://github.com/huggingface/datasets/pull/7411
@ibarrien made their first contribution in https://github.com/huggingface/datasets/pull/7407
@jingedawang made their first contribution in https://github.com/huggingface/datasets/pull/7402

Full Changelog: huggingface/datasets@3.3.1...3.3.2

`v3.3.1`

Compare Source

Bug fixes

Fix filter speed regression by @lhoestq in https://github.com/huggingface/datasets/pull/7408

Full Changelog: huggingface/datasets@3.3.0...3.3.1

`v3.3.0`

Compare Source

Dataset Features

Support async functions in map() by @lhoestq in https://github.com/huggingface/datasets/pull/7384

Especially useful to download content like images or call inference APIs

prompt = "Answer the following question: {question}. You should think step by step."
async def ask_llm(example):
    return await query_model(prompt.format(question=example["question"]))
ds = ds.map(ask_llm)

Add repeat method to datasets by @alex-hh in https://github.com/huggingface/datasets/pull/7198
```
ds = ds.repeat(10)
```
Support faster processing using pandas or polars functions in IterableDataset.map() by @lhoestq in https://github.com/huggingface/datasets/pull/7370
- Add support for "pandas" and "polars" formats in IterableDatasets
- This enables optimized data processing using pandas or polars functions with zero-copy, e.g.
```
ds = load_dataset("ServiceNow-AI/R1-Distill-SFT", "v0", split="train", streaming=True)
ds = ds.with_format("polars")
expr = pl.col("solution").str.extract("boxed\\{(.*)\\}").alias("value_solution")
ds = ds.map(lambda df: df.with_columns(expr), batched=True)
```
Apply formatting after iter_arrow to speed up format -> map, filter for iterable datasets by @alex-hh in https://github.com/huggingface/datasets/pull/7207
- IterableDatasets with "numpy" format are now much faster

What's Changed

don't import soundfile in tests by @lhoestq in https://github.com/huggingface/datasets/pull/7340
minor video docs on how to install by @lhoestq in https://github.com/huggingface/datasets/pull/7341
Fix typo in arrow_dataset by @AndreaFrancis in https://github.com/huggingface/datasets/pull/7328
remove filecheck to enable symlinks by @fschlatt in https://github.com/huggingface/datasets/pull/7133
Webdataset special columns in last position by @lhoestq in https://github.com/huggingface/datasets/pull/7349
Bump hfh to 0.24 to fix ci by @lhoestq in https://github.com/huggingface/datasets/pull/7350
fsspec 2024.12.0 by @lhoestq in https://github.com/huggingface/datasets/pull/7352
changes to MappedExamplesIterable to resolve #7345 by @vttrifonov in https://github.com/huggingface/datasets/pull/7353
Catch OSError for arrow by @lhoestq in https://github.com/huggingface/datasets/pull/7348
Remove .h5 from imagefolder extensions by @lhoestq in https://github.com/huggingface/datasets/pull/7374
Add Pandas, PyArrow and Polars docs by @lhoestq in https://github.com/huggingface/datasets/pull/7382
Optimized sequence encoding for scalars by @lukasgd in https://github.com/huggingface/datasets/pull/7393
Update docs by @lhoestq in https://github.com/huggingface/datasets/pull/7395
Update README.md by @lhoestq in https://github.com/huggingface/datasets/pull/7396
Release: 3.3.0 by @lhoestq in https://github.com/huggingface/datasets/pull/7398

New Contributors

@AndreaFrancis made their first contribution in https://github.com/huggingface/datasets/pull/7328
@vttrifonov made their first contribution in https://github.com/huggingface/datasets/pull/7353
@lukasgd made their first contribution in https://github.com/huggingface/datasets/pull/7393

Full Changelog: huggingface/datasets@3.2.0...3.3.0

`v3.2.0`

Compare Source

Dataset Features

Faster parquet streaming + filters with predicate pushdown by @lhoestq in https://github.com/huggingface/datasets/pull/7309
- Up to +100% streaming speed
- Fast filtering via predicate pushdown (skip files/row groups based on predicate instead of downloading the full data), e.g.
```
from datasets import load_dataset
filters = [('date', '>=', '2023')]
ds = load_dataset("HuggingFaceFW/fineweb-2", "fra_Latn", streaming=True, filters=filters)
```

Other improvements and bug fixes

fix conda release worlflow by @lhoestq in https://github.com/huggingface/datasets/pull/7272
Add link to video dataset by @NielsRogge in https://github.com/huggingface/datasets/pull/7277
Raise error for incorrect JSON serialization by @varadhbhatnagar in https://github.com/huggingface/datasets/pull/7273
support for custom feature encoding/decoding by @alex-hh in https://github.com/huggingface/datasets/pull/7284
update load_dataset doctring by @lhoestq in https://github.com/huggingface/datasets/pull/7301
Let server decide default repo visibility by @Wauplin in https://github.com/huggingface/datasets/pull/7302
fix: update elasticsearch version by @ruidazeng in https://github.com/huggingface/datasets/pull/7300
Fix typing in iterable_dataset.py by @lhoestq in https://github.com/huggingface/datasets/pull/7304
Updated inconsistent output in documentation examples for ClassLabel by @sergiopaniego in https://github.com/huggingface/datasets/pull/7293
More docs to from_dict to mention that the result lives in RAM by @lhoestq in https://github.com/huggingface/datasets/pull/7316
Release: 3.2.0 by @lhoestq in https://github.com/huggingface/datasets/pull/7317

New Contributors

@ruidazeng made their first contribution in https://github.com/huggingface/datasets/pull/7300
@sergiopaniego made their first contribution in https://github.com/huggingface/datasets/pull/7293

Full Changelog: huggingface/datasets@3.1.0...3.2.0

`v3.1.0`

Compare Source

Dataset Features

Video support by @lhoestq in https://github.com/huggingface/datasets/pull/7230

>>> from datasets import Dataset, Video, load_dataset
>>> ds = Dataset.from_dict({"video":["path/to/Screen Recording.mov"]}).cast_column("video", Video())
>>> # or from the hub
>>> ds = load_dataset("username/dataset_name", split="train")
>>> ds[0]["video"]
<decord.video_reader.VideoReader at 0x105525c70>

Add IterableDataset.shard() by @lhoestq in https://github.com/huggingface/datasets/pull/7252

>>> from datasets import load_dataset
>>> full_ds = load_dataset("amphion/Emilia-Dataset", split="train", streaming=True)
>>> full_ds.num_shards
2360
>>> ds = full_ds.shard(num_shards=ds.num_shards, index=0)
>>> ds.num_shards
1
>>> ds = full_ds.shard(num_shards=8, index=0)
>>> ds.num_shards
295

Basic XML support by @lhoestq in https://github.com/huggingface/datasets/pull/7250

What's Changed

(Super tiny doc update) Mention to_polars by @fzyzcjy in https://github.com/huggingface/datasets/pull/7232
[MINOR:TYPO] Update arrow_dataset.py by @cakiki in https://github.com/huggingface/datasets/pull/7236
Missing video docs by @lhoestq in https://github.com/huggingface/datasets/pull/7251
fix decord import by @lhoestq in https://github.com/huggingface/datasets/pull/7255
fix ci for pyarrow 18 by @lhoestq in https://github.com/huggingface/datasets/pull/7257
Retry all requests timeouts by @lhoestq in https://github.com/huggingface/datasets/pull/7256
Always set non-null writer batch size by @lhoestq in https://github.com/huggingface/datasets/pull/7258
Don't embed videos by @lhoestq in https://github.com/huggingface/datasets/pull/7259
Allow video with disabeld decoding without decord by @lhoestq in https://github.com/huggingface/datasets/pull/7262
Small addition to video docs by @lhoestq in https://github.com/huggingface/datasets/pull/7263
fix docs relative links by @lhoestq in https://github.com/huggingface/datasets/pull/7264
Disallow video push_to_hub by @lhoestq in https://github.com/huggingface/datasets/pull/7265

New Contributors

@fzyzcjy made their first contribution in https://github.com/huggingface/datasets/pull/7232

Full Changelog: huggingface/datasets@3.0.2...3.1.0

`v3.0.2`

Compare Source

Main bug fixes

fix unbatched arrow map for iterable datasets by @alex-hh in https://github.com/huggingface/datasets/pull/7204
Support features in metadata configs by @albertvillanova in https://github.com/huggingface/datasets/pull/7182
Preserve features in iterable dataset.filter by @alex-hh in https://github.com/huggingface/datasets/pull/7209
Pin dill<0.3.9 to fix CI by @albertvillanova in https://github.com/huggingface/datasets/pull/7184
- this should also fix cache issues

What's Changed

Fix release instructions by @albertvillanova in https://github.com/huggingface/datasets/pull/7177
Pin multiprocess<0.70.1 to align with dill<0.3.9 by @albertvillanova in https://github.com/huggingface/datasets/pull/7188
with_format docstring by @lhoestq in https://github.com/huggingface/datasets/pull/7203
fix ci benchmark by @lhoestq in https://github.com/huggingface/datasets/pull/7205
Fix the environment variable for huggingface cache by @torotoki in https://github.com/huggingface/datasets/pull/7200
Support Python 3.11 by @albertvillanova in https://github.com/huggingface/datasets/pull/7179
bump fsspec by @lhoestq in https://github.com/huggingface/datasets/pull/7219
Fix typo in image dataset docs by @albertvillanova in https://github.com/huggingface/datasets/pull/7231
No need for dataset_info by @lhoestq in https://github.com/huggingface/datasets/pull/7234
use huggingface_hub offline mode by @lhoestq in https://github.com/huggingface/datasets/pull/7244

New Contributors

@alex-hh made their first contribution in https://github.com/huggingface/datasets/pull/7204
@torotoki made their first contribution in https://github.com/huggingface/datasets/pull/7200

Full Changelog: huggingface/datasets@3.0.1...3.0.2

`v3.0.1`

Compare Source

What's Changed

Modify add_column() to optionally accept a FeatureType as param by @varadhbhatnagar in https://github.com/huggingface/datasets/pull/7143
Align filename prefix splitting with WebDataset library by @albertvillanova in https://github.com/huggingface/datasets/pull/7151
Support ndjson data files by @albertvillanova in https://github.com/huggingface/datasets/pull/7154
Support JSON lines with missing struct fields by @albertvillanova in https://github.com/huggingface/datasets/pull/7160
Support JSON lines with empty struct by @albertvillanova in https://github.com/huggingface/datasets/pull/7162
fix increase_load_count by @lhoestq in https://github.com/huggingface/datasets/pull/7165
fix docstring code example for distributed shuffle by @lhoestq in https://github.com/huggingface/datasets/pull/7166
Support JSON lines with missing columns by @albertvillanova in https://github.com/huggingface/datasets/pull/7170
Add torchdata as a regular test dependency by @albertvillanova in https://github.com/huggingface/datasets/pull/7172

New Contributors

@varadhbhatnagar made their first contribution in https://github.com/huggingface/datasets/pull/7143

Full Changelog: huggingface/datasets@3.0.0...3.0.1

`v3.0.0`

Compare Source

Dataset Features

Use Polars functions in .map()

Allow Polars as valid output type by @psmyth94 in https://github.com/huggingface/datasets/pull/6762

Example:

>>> from datasets import load_dataset
>>> ds = load_dataset("lhoestq/CudyPokemonAdventures", split="train").with_format("polars")
>>> cols = [pl.col("content").str.len_bytes().alias("length")]
>>> ds_with_length = ds.map(lambda df: df.with_columns(cols), batched=True)
>>> ds_with_length[:5]
shape: (5, 5)
┌─────┬───────────────────────────────────┬───────────────────────────────────┬───────────────────────┬────────┐
│ idx ┆ title                             ┆ content                           ┆ labels                ┆ length │
│ --- ┆ ---                               ┆ ---                               ┆ ---                   ┆ ---    │
│ i64 ┆ str                               ┆ str                               ┆ str                   ┆ u32    │
╞═════╪═══════════════════════════════════╪═══════════════════════════════════╪═══════════════════════╪════════╡
│ 0   ┆ The Joyful Adventure of Bulbasau… ┆ Bulbasaur embarked on a sunny qu… ┆ joyful_adventure      ┆ 180    │
│ 1   ┆ Pikachu's Quest for Peace         ┆ Pikachu, with his cheeky persona… ┆ peaceful_narrative    ┆ 138    │
│ 2   ┆ The Tender Tale of Squirtle       ┆ Squirtle took everyone on a memo… ┆ gentle_adventure      ┆ 135    │
│ 3   ┆ Charizard's Heartwarming Tale     ┆ Charizard found joy in helping o… ┆ heartwarming_story    ┆ 112    │
│ 4   ┆ Jolteon's Sparkling Journey       ┆ Jolteon, with his zest for life,… ┆ celebratory_narrative ┆ 111    │
└─────┴───────────────────────────────────┴───────────────────────────────────┴───────────────────────┴────────┘

Support NumPy 2
- Allow numpy-2.1 and test it without audio extra by @albertvillanova in https://github.com/huggingface/datasets/pull/7118

Cache Changes

Use huggingface_hub cache by @lhoestq in https://github.com/huggingface/datasets/pull/7105
- use the huggingface_hub cache for files downloaded from HF, by default at ~/.cache/huggingface/hub
- cached datasets (Arrow files) will still be reloaded from the datasets cache, by default at ~/.cache/huggingface/datasets

Breaking changes

Remove deprecated code by @albertvillanova in https://github.com/huggingface/datasets/pull/6996
- removed deprecated arguments like use_auth_token, fs or ignore_verifications
Remove beam by @albertvillanova in https://github.com/huggingface/datasets/pull/6987
- removed deprecated apache beam datasets support
Remove metrics by @albertvillanova in https://github.com/huggingface/datasets/pull/6983
- remove deprecated load_metric, please use the evaluate library instead
Remove tasks by @albertvillanova in https://github.com/huggingface/datasets/pull/6999
- remove deprecated task argument in load_dataset() .prepare_for_task() method, datasets.tasks module

General improvements and bug fixes

Improved the tutorial by adding a link for loading datasets by @AmboThom in https://github.com/huggingface/datasets/pull/7042
Automatically create cache_dir from cache_file_name by @ringohoffman in https://github.com/huggingface/datasets/pull/7096
remove more script docs by @lhoestq in https://github.com/huggingface/datasets/pull/7104
Fix args of feature docstrings by @albertvillanova in https://github.com/huggingface/datasets/pull/7103
Temporarily pin numpy<2.1 to fix CI by @albertvillanova in https://github.com/huggingface/datasets/pull/7114
Fix ConnectionError for gated datasets and unauthenticated users by @albertvillanova in https://github.com/huggingface/datasets/pull/7110
Install transformers with numpy-2 CI by @albertvillanova in https://github.com/huggingface/datasets/pull/7119
don't mention the script if trust_remote_code=False by @severo in https://github.com/huggingface/datasets/pull/7120
Fix typed examples iterable state dict by @lhoestq in https://github.com/huggingface/datasets/pull/7121
Rename LargeList.dtype to LargeList.feature by @albertvillanova in https://github.com/huggingface/datasets/pull/7106
Fix wrong SHA in CI tests of HubDatasetModuleFactoryWithParquetExport by @albertvillanova in https://github.com/huggingface/datasets/pull/7125
Disable implicit token in CI by @albertvillanova in https://github.com/huggingface/datasets/pull/7126
Test get_dataset_config_info with non-existing/gated/private dataset by @albertvillanova in https://github.com/huggingface/datasets/pull/7124
fix streaming from arrow files by @fschlatt in https://github.com/huggingface/datasets/pull/7083

New Contributors

@AmboThom made their first contribution in https://github.com/huggingface/datasets/pull/7042
@fschlatt made their first contribution in https://github.com/huggingface/datasets/pull/7083

Full Changelog: huggingface/datasets@2.21.0...3.0.0

Configuration

📅 Schedule: Branch creation - At any time (no schedule defined), Automerge - At any time (no schedule defined).

🚦 Automerge: Disabled by config. Please merge this manually once you are satisfied.

♻ Rebasing: Whenever PR is behind base branch, or you tick the rebase/retry checkbox.

🔕 Ignore: Close this PR and you won't be reminded about this update again.

If you want to rebase/retry this PR, check this box

This PR has been generated by Renovate Bot.

app-token-issuer-releng-renovate · 2025-07-17T00:02:57Z

Edited/Blocked Notification

Renovate will not automatically rebase this PR, because it does not recognize the last commit author and assumes somebody else may have edited the PR.

You can manually request rebase by checking the rebase/retry box above.

⚠️ Warning: custom changes will be lost.

github-actions · 2025-08-16T00:47:46Z

This PR is stale because it has been open 30 days with no activity.
Remove the stale label or comment or this will be closed in 7 days.

github-actions · 2025-08-24T00:47:56Z

This PR has been automatically closed because it has been stale for > 30 days.
If you wish to continue working on this PR, please reopen it and make any necessary changes.

app-token-issuer-releng-renovate · 2025-08-25T00:03:28Z

Renovate Ignore Notification

Because you closed this PR without merging, Renovate will ignore this update. You will not get PRs for any future 4.x releases. But if you manually upgrade to 4.x then Renovate will re-enable minor and patch updates automatically.

If you accidentally closed this PR, or if you changed your mind: rename this PR to get a fresh replacement PR.

chore(deps): update dependency datasets to v4

11e5ccd

app-token-issuer-releng-renovate bot requested a review from a team as a code owner July 16, 2025 00:03

app-token-issuer-releng-renovate bot added the renovate label Jul 16, 2025

product-security-plaid-test bot requested review from ChrisAmora, Tofel, chainchad, ecPablo, erikburt, gheorghestrimtu, gustavogama-cll, kalverra, kdihalas and skudasov July 16, 2025 00:03

Tofel approved these changes Jul 16, 2025

View reviewed changes

ChrisAmora approved these changes Jul 16, 2025

View reviewed changes

github-actions bot added the Stale Stale PRs will be closed after 7 days unless the label is removed or new updates are made. label Aug 16, 2025

github-actions bot closed this Aug 24, 2025

github-actions bot deleted the renovate/datasets-4.x branch August 24, 2025 00:47

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

chore(deps): update dependency datasets to v4 #1138

chore(deps): update dependency datasets to v4 #1138

Uh oh!

app-token-issuer-releng-renovate bot commented Jul 16, 2025

Uh oh!

app-token-issuer-releng-renovate bot commented Jul 17, 2025

Uh oh!

github-actions bot commented Aug 16, 2025

Uh oh!

github-actions bot commented Aug 24, 2025

Uh oh!

app-token-issuer-releng-renovate bot commented Aug 25, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

chore(deps): update dependency datasets to v4 #1138

chore(deps): update dependency datasets to v4 #1138

Uh oh!

Conversation

app-token-issuer-releng-renovate bot commented Jul 16, 2025

Release Notes

New Features

Build streaming data pipelines in a few lines of code !

Syntax:

Iterate on a column:

Load one cell without bringing the full column in memory

Breaking changes

Other improvements and bug fixes

New Contributors

Dataset Features

Other improvements and bug fixes

New Contributors

Bug fixes

Other improvements

New Contributors

Datasets Features

What's Changed

New Contributors

Bug Fixes

Dataset Features

General improvements and bug fixes

New Contributors

Bug fixes

Other general improvements

New Contributors

Bug fixes

Dataset Features

What's Changed

New Contributors

Dataset Features

Other improvements and bug fixes

New Contributors

Dataset Features

What's Changed

New Contributors

Main bug fixes

What's Changed

New Contributors

What's Changed

New Contributors

Dataset Features

Cache Changes

Breaking changes

General improvements and bug fixes

New Contributors

Configuration

Uh oh!

app-token-issuer-releng-renovate bot commented Jul 17, 2025

Edited/Blocked Notification

Uh oh!

github-actions bot commented Aug 16, 2025

Uh oh!

github-actions bot commented Aug 24, 2025

Uh oh!

app-token-issuer-releng-renovate bot commented Aug 25, 2025

Renovate Ignore Notification

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants