-
Notifications
You must be signed in to change notification settings - Fork 2.8k
Preserve formatting in concatenated IterableDataset #7522
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Open
francescorubbo
wants to merge
7
commits into
huggingface:main
Choose a base branch
from
francescorubbo:concat_iterable_with_format
base: main
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
+31
−1
Open
Changes from 3 commits
Commits
Show all changes
7 commits
Select commit
Hold shift + click to select a range
4a575b5
Preserve formatting in concatenated iterable dataset when the inputs …
francescorubbo e07365d
Merge branch 'main' into concat_iterable_with_format
lhoestq 8550975
style
lhoestq 721833f
If `dset._formatting` is None for any of the datasets, set the concat…
francescorubbo 1e55f37
fix incorrect grouping
francescorubbo 70d8673
Reset output formatting if any of the inputs has formatting not set
francescorubbo a724b12
log unset format also in case `formatting` is set, but `format_type` …
francescorubbo File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -3408,6 +3408,10 @@ def _concatenate_iterable_datasets( | |
else: | ||
_check_column_names([col_name for dset in dsets for col_name in dset.features]) | ||
|
||
# Check format is consistent; if so, will set format for concatenated dataset | ||
format_type_set = {dset._formatting.format_type for dset in dsets} | ||
format_type = format_type_set.pop() if len(format_type_set) == 1 else None | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. if the datasets have disparate formats, maybe you can add this logging info (same as for concatenating logger.info("Some of the datasets have disparate format. Resetting the format of the concatenated dataset.") |
||
|
||
# TODO: improve this to account for a mix of ClassLabel and Value for example | ||
# right now it would keep the type of the first dataset in the list | ||
features = Features( | ||
|
@@ -3429,7 +3433,13 @@ def _concatenate_iterable_datasets( | |
# Get all the auth tokens per repository - in case the datasets come from different private repositories | ||
token_per_repo_id = {repo_id: token for dataset in dsets for repo_id, token in dataset._token_per_repo_id.items()} | ||
# Return new daset | ||
return IterableDataset(ex_iterable=ex_iterable, info=info, split=split, token_per_repo_id=token_per_repo_id) | ||
return IterableDataset( | ||
ex_iterable=ex_iterable, | ||
info=info, | ||
split=split, | ||
token_per_repo_id=token_per_repo_id, | ||
formatting=FormattingConfig(format_type=format_type), | ||
) | ||
|
||
|
||
def _interleave_iterable_datasets( | ||
|
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
dset._formatting
doesn't always existSo if
dset._formatting
is None for all the datasets, the formatting of the output dataset should be None as wellThere was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've added a check in the set comprehension, so that it skips elements where
dset._formatting
isNone
. This means that we will try to set the output format to that of the inputs who do havedset._formatting
. Is that reasonable? or we should set the output format toNone
in this case?There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
any discrepancy should result in formatting=None, if we want to be consistent with the
Dataset
APIThere was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sounds good! I've reworked the logic to account for that.