-
Notifications
You must be signed in to change notification settings - Fork 37
Synchronize end of loader for repeat=False #146
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Open
voegtlel
wants to merge
6
commits into
develop
Choose a base branch
from
feature/end_of_loader_sync
base: develop
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
Open
Changes from all commits
Commits
Show all changes
6 commits
Select commit
Hold shift + click to select a range
c12a188
Small test fixes
voegtlel 476c739
Implement synchronization at the end of dataset loader with repeat=False
voegtlel a0402fc
Add end of dataset sync to documentation.
voegtlel 7ec2298
Just fix
voegtlel 64b7f6b
Fix sync test import
voegtlel 6a8d94d
Small doc adaption
voegtlel File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,21 @@ | ||
| # Copyright (c) 2025, NVIDIA CORPORATION. | ||
| # SPDX-License-Identifier: BSD-3-Clause | ||
| from megatron.energon.sync_end.redistribute import RedistributeDataLoaderState, RedistributeLoader | ||
| from megatron.energon.sync_end.stop_first_end import StopFirstDataLoaderState, StopFirstLoader | ||
|
|
||
| """ | ||
| Provides wrappers for the dataset loaders that allow for synchronization at the end of the dataset. | ||
| I.e. if running a training with repeat=False, the loaders will typically exhaust at different times, which may require | ||
| synchronization across ranks. | ||
|
|
||
| The wrappers are: | ||
| - RedistributeLoader: Redistributes the last samples to the ranks that are not exhausted. | ||
| - StopFirstLoader: Stops iterating as soon as the first rank is exhausted. | ||
| """ | ||
|
|
||
| __all__ = [ | ||
| "RedistributeLoader", | ||
| "RedistributeDataLoaderState", | ||
| "StopFirstLoader", | ||
| "StopFirstDataLoaderState", | ||
| ] |
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it would be better if this was transparent to the user, so make
get_loaderhandle this internally.And since the choice of
RedistributeLoadervs.StopFirstLoaderactually changes the data that's being iterated, this choice should be made in the metadataset and not in the code, I think.As a property of blend_epochized.
I.e. blend_epochized can either be a list as before (chooses default
RedistributeLoader), or it can be a dict for more customization likeThere was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Okay, we can make this be handled in
get_loader. I personally would prefer it to be separate though, as it's a feature on top?Regarding moving the configuration to the metadataset:
I see your point that this slightly modifies how the data is iterated, but I'd also argue:
blend_epochized, because you cannot nest different (or unconfigured)phase_out_behavior.repeat=Falseand doesn't make sense ifrepeat=True, so it's based on what the user sets in the code.RedistributeLoaderit should not really change the data frequency (so far the settings in the metadataset mainly focuses on data frequency / blend).Thus voting for keeping this in code, not in the metadataset config.