Preserve formatting in concatenated IterableDataset #7522

francescorubbo · 2025-04-16T02:37:33Z

…have consistent formatting

HuggingFaceDocBuilderDev · 2025-04-28T13:41:50Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

lhoestq

Cool ! I took the liberty to update the branch with the changes from main, and added come comments:

lhoestq · 2025-04-28T13:46:42Z

src/datasets/iterable_dataset.py

@@ -3408,6 +3408,10 @@ def _concatenate_iterable_datasets(
    else:
        _check_column_names([col_name for dset in dsets for col_name in dset.features])

+    # Check format is consistent; if so, will set format for concatenated dataset
+    format_type_set = {dset._formatting.format_type for dset in dsets}


dset._formatting doesn't always exist

So if dset._formatting is None for all the datasets, the formatting of the output dataset should be None as well

I've added a check in the set comprehension, so that it skips elements where dset._formatting is None. This means that we will try to set the output format to that of the inputs who do have dset._formatting. Is that reasonable? or we should set the output format to None in this case?

any discrepancy should result in formatting=None, if we want to be consistent with the Dataset API

Sounds good! I've reworked the logic to account for that.

lhoestq · 2025-04-28T13:47:26Z

src/datasets/iterable_dataset.py

@@ -3408,6 +3408,10 @@ def _concatenate_iterable_datasets(
    else:
        _check_column_names([col_name for dset in dsets for col_name in dset.features])

+    # Check format is consistent; if so, will set format for concatenated dataset
+    format_type_set = {dset._formatting.format_type for dset in dsets}
+    format_type = format_type_set.pop() if len(format_type_set) == 1 else None


if the datasets have disparate formats, maybe you can add this logging info (same as for concatenating Dataset objects):

logger.info("Some of the datasets have disparate format. Resetting the format of the concatenated dataset.")

…enated dataset format to None. Add log line for inputs with inconsistent format.

…is None

lhoestq

lgtm !

edit: applied a small change (github messed up and applied it three times smh so it took several commits)

src/datasets/iterable_dataset.py

Preserve formatting in concatenated iterable dataset when the inputs …

4a575b5

…have consistent formatting

francescorubbo mentioned this pull request Apr 16, 2025

concatenate_datasets does not preserve Pytorch format for IterableDataset #7515

Closed

Merge branch 'main' into concat_iterable_with_format

e07365d

style

8550975

lhoestq reviewed Apr 28, 2025

View reviewed changes

francescorubbo added 4 commits April 28, 2025 07:21

If dset._formatting is None for any of the datasets, set the concat…

721833f

…enated dataset format to None. Add log line for inputs with inconsistent format.

fix incorrect grouping

1e55f37

Reset output formatting if any of the inputs has formatting not set

70d8673

log unset format also in case formatting is set, but format_type …

a724b12

…is None

lhoestq approved these changes May 7, 2025

View reviewed changes

src/datasets/iterable_dataset.py Outdated Show resolved Hide resolved

lhoestq and others added 5 commits May 7, 2025 16:04

Update src/datasets/iterable_dataset.py

8899cec

Update src/datasets/iterable_dataset.py

96e5be5

Update src/datasets/iterable_dataset.py

76f2dcf

Update iterable_dataset.py

2fe7177

Merge branch 'main' into concat_iterable_with_format

464bc60

lhoestq merged commit 53f958e into huggingface:main May 19, 2025
1 of 2 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Preserve formatting in concatenated IterableDataset #7522

Preserve formatting in concatenated IterableDataset #7522

francescorubbo commented Apr 16, 2025

Uh oh!

HuggingFaceDocBuilderDev commented Apr 28, 2025

Uh oh!

lhoestq left a comment

Uh oh!

lhoestq Apr 28, 2025

Uh oh!

francescorubbo Apr 28, 2025

Uh oh!

lhoestq Apr 28, 2025

Uh oh!

francescorubbo Apr 28, 2025

Uh oh!

lhoestq Apr 28, 2025

Uh oh!

lhoestq left a comment •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Preserve formatting in concatenated IterableDataset #7522

Preserve formatting in concatenated IterableDataset #7522

Conversation

francescorubbo commented Apr 16, 2025

Uh oh!

HuggingFaceDocBuilderDev commented Apr 28, 2025

Uh oh!

lhoestq left a comment

Choose a reason for hiding this comment

Uh oh!

lhoestq Apr 28, 2025

Choose a reason for hiding this comment

Uh oh!

francescorubbo Apr 28, 2025

Choose a reason for hiding this comment

Uh oh!

lhoestq Apr 28, 2025

Choose a reason for hiding this comment

Uh oh!

francescorubbo Apr 28, 2025

Choose a reason for hiding this comment

Uh oh!

lhoestq Apr 28, 2025

Choose a reason for hiding this comment

Uh oh!

lhoestq left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

lhoestq left a comment •

edited

Loading