chore(deps): update dependency datasets to v4 #1138
Closed
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
This PR contains the following updates:
==2.21.0
->==4.0.0
Warning
Some dependencies could not be looked up. Check the Dependency Dashboard for more information.
Release Notes
huggingface/datasets (datasets)
v4.0.0
Compare Source
New Features
Add
IterableDataset.push_to_hub()
by @lhoestq in https://github.com/huggingface/datasets/pull/7595Build streaming data pipelines in a few lines of code !
from datasets import load_dataset
ds = load_dataset(..., streaming=True)
ds = ds.map(...).filter(...)
ds.push_to_hub(...)
New
Column
objectSyntax:
ds["column_name"] # datasets.Column([...]) or datasets.IterableColumn(...)
Iterate on a column:
for text in ds["text"]:
...
Load one cell without bringing the full column in memory
first_text = ds["text"][0] # equivalent to ds[0]["text"]
torch>=2.7.0
and FFmpeg >= 4datasets<4.0
AudioDecoder
:VideoDecoder
:Breaking changes
Remove scripts altogether by @lhoestq in https://github.com/huggingface/datasets/pull/7592
trust_remote_code
is no longer supportedTorchcodec decoding by @TyTodd in https://github.com/huggingface/datasets/pull/7616
Replace Sequence by List by @lhoestq in https://github.com/huggingface/datasets/pull/7634
List
typeSequence
was a legacy type from tensorflow datasets which converted list of dicts to dicts of lists. It is no longer a type but it becomes a utility that returns aList
or adict
depending on the subfeatureOther improvements and bug fixes
Dataset.map
to reuse cache files mapped with differentnum_proc
by @ringohoffman in https://github.com/huggingface/datasets/pull/7434RepeatExamplesIterable
by @SilvanCodes in https://github.com/huggingface/datasets/pull/7581_dill.py
to useco_linetable
for Python 3.10+ in place ofco_lnotab
by @qgallouedec in https://github.com/huggingface/datasets/pull/7609New Contributors
Full Changelog: huggingface/datasets@3.6.0...4.0.0
v3.6.0
Compare Source
Dataset Features
Other improvements and bug fixes
aiohttp
from direct dependencies by @akx in https://github.com/huggingface/datasets/pull/7294New Contributors
Full Changelog: huggingface/datasets@3.5.1...3.6.0
v3.5.1
Compare Source
Bug fixes
TypeError: ArrayExtensionArray.to_pylist() got an unexpected keyword argument 'maps_as_pydicts'
Other improvements
New Contributors
Full Changelog: huggingface/datasets@3.5.0...3.5.1
v3.5.0
Compare Source
Datasets Features
What's Changed
New Contributors
Full Changelog: huggingface/datasets@3.4.1...3.5.0
v3.4.1
Compare Source
Bug Fixes
Full Changelog: huggingface/datasets@3.4.0...3.4.1
v3.4.0
Compare Source
Dataset Features
Faster folder based builder + parquet support + allow repeated media + use torchvideo by @lhoestq in https://github.com/huggingface/datasets/pull/7424
decord
withtorchvision
to read videos, sincedecord
is not maintained anymore and isn't available for recent python versions, see the video dataset loading documentation here for more details. TheVideo
type is still marked as experimental is this versionmetadata.parquet
in addition tometadata.csv
ormetadata.jsonl
for the metadata of the image/audio/video filesAdd IterableDataset.decode with multithreading by @lhoestq in https://github.com/huggingface/datasets/pull/7450
Add with_split to DatasetDict.map by @jp1924 in https://github.com/huggingface/datasets/pull/7368
General improvements and bug fixes
string_to_dict
to returnNone
if there is no match instead of raisingValueError
by @ringohoffman in https://github.com/huggingface/datasets/pull/7435ds.set_epoch(new_epoch)
by @lhoestq in https://github.com/huggingface/datasets/pull/7451New Contributors
Full Changelog: huggingface/datasets@3.3.2...3.4.0
v3.3.2
Compare Source
Bug fixes
Other general improvements
New Contributors
Full Changelog: huggingface/datasets@3.3.1...3.3.2
v3.3.1
Compare Source
Bug fixes
Full Changelog: huggingface/datasets@3.3.0...3.3.1
v3.3.0
Compare Source
Dataset Features
Support async functions in map() by @lhoestq in https://github.com/huggingface/datasets/pull/7384
Add repeat method to datasets by @alex-hh in https://github.com/huggingface/datasets/pull/7198
Support faster processing using pandas or polars functions in
IterableDataset.map()
by @lhoestq in https://github.com/huggingface/datasets/pull/7370Apply formatting after iter_arrow to speed up format -> map, filter for iterable datasets by @alex-hh in https://github.com/huggingface/datasets/pull/7207
What's Changed
New Contributors
Full Changelog: huggingface/datasets@3.2.0...3.3.0
v3.2.0
Compare Source
Dataset Features
Other improvements and bug fixes
ClassLabel
by @sergiopaniego in https://github.com/huggingface/datasets/pull/7293New Contributors
Full Changelog: huggingface/datasets@3.1.0...3.2.0
v3.1.0
Compare Source
Dataset Features
What's Changed
New Contributors
Full Changelog: huggingface/datasets@3.0.2...3.1.0
v3.0.2
Compare Source
Main bug fixes
What's Changed
New Contributors
Full Changelog: huggingface/datasets@3.0.1...3.0.2
v3.0.1
Compare Source
What's Changed
New Contributors
Full Changelog: huggingface/datasets@3.0.0...3.0.1
v3.0.0
Compare Source
Dataset Features
.map()
Allow Polars as valid output type by @psmyth94 in https://github.com/huggingface/datasets/pull/6762
Example:
Cache Changes
huggingface_hub
cache by @lhoestq in https://github.com/huggingface/datasets/pull/7105huggingface_hub
cache for files downloaded from HF, by default at~/.cache/huggingface/hub
datasets
cache, by default at~/.cache/huggingface/datasets
Breaking changes
use_auth_token
,fs
orignore_verifications
load_metric
, please use theevaluate
library insteadtask
argument inload_dataset()
.prepare_for_task()
method,datasets.tasks
moduleGeneral improvements and bug fixes
cache_dir
fromcache_file_name
by @ringohoffman in https://github.com/huggingface/datasets/pull/7096New Contributors
Full Changelog: huggingface/datasets@2.21.0...3.0.0
Configuration
📅 Schedule: Branch creation - At any time (no schedule defined), Automerge - At any time (no schedule defined).
🚦 Automerge: Disabled by config. Please merge this manually once you are satisfied.
♻ Rebasing: Whenever PR is behind base branch, or you tick the rebase/retry checkbox.
🔕 Ignore: Close this PR and you won't be reminded about this update again.
This PR has been generated by Renovate Bot.