-
Notifications
You must be signed in to change notification settings - Fork 79
Pccl integration #241
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Open
mikex86
wants to merge
47
commits into
prime-v2
Choose a base branch
from
pccl-integration
base: prime-v2
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
Open
Pccl integration #241
Changes from all commits
Commits
Show all changes
47 commits
Select commit
Hold shift + click to select a range
5a45af8
small cleanup
mikex86 e96c5e7
introduce mpi info and allow non-mpi runs
mikex86 f5263f8
working fsdp with pccl accept loop
mikex86 1cd3764
working sync DiLoCo
mikex86 43be48c
working async DiLoCo
mikex86 6e60343
introduce functions for sanity
mikex86 471f640
configurable async/non-async DiLoCo
mikex86 2a842ad
implemented nibble dataset
mikex86 12351cb
fix bug where outer lr is not set
mikex86 efe58d9
fix configs & unit tests
mikex86 c74d5b2
fix ruff
mikex86 eaeec2a
fix ruff
mikex86 1160e49
clone pccl dependency via git instead of https
mikex86 8bd560d
fix pccl git url
mikex86 f9e37ea
backported ParquetDataset
mikex86 906517e
fix ruff
mikex86 22c7249
fix pending mpi ranks join wait logic
mikex86 09b515b
add launch prime script & add nibble ds folder listing support
mikex86 a072ba2
add streaming data loader support
mikex86 8867b06
fix fake data loader
mikex86 bf39f61
fix for 8xH100
mikex86 7f6d0af
add H100 config
mikex86 1762ce2
add topology optimization
mikex86 25b27ce
change config of launch_prime.sh
mikex86 bebdd0f
small changes
mikex86 cb20ff1
fix training_progress.step
mikex86 b9615ac
log outer step
mikex86 02647c0
fix incompetence
mikex86 f89e4a9
set step from shared state synced var exactly post shared state sync
mikex86 4573f0d
fix NCCL crash on some Lambda nodes
mikex86 12f89c4
enable topology optimization
mikex86 30ff670
utilize SharedStateSyncStrategy
mikex86 933ba5b
fix typo
mikex86 fee98cd
bump pccl commit revision
mikex86 2ba792c
bump pccl commit revision
mikex86 387f0bd
bump pccl commit revision
mikex86 00b9724
bump pccl commit revision
mikex86 3f8f849
fix config dataset
samsja 4060e5f
add logging pccl
samsja 54245d5
add diloco delayed default vlaue
samsja a08b87e
fix fake data
samsja 01e63a9
bump pccl commit revision
mikex86 3d47532
Add dtype argument
mikex86 4b170e1
Add missing dtype argument
mikex86 4be8810
bump pccl commit revision
mikex86 1642298
revert
mikex86 60d83bd
fix unused imports
mikex86 File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -25,3 +25,7 @@ weight_decay = 0.1 | |
[data] | ||
seq_length = 8192 | ||
num_workers = 4 | ||
fake = true | ||
|
||
[pccl] | ||
ccoip_host = "127.0.0.1:48148" |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,15 +1,42 @@ | ||
project = "debug_7B_zero_band" | ||
|
||
model_name = "7B" | ||
model_type = "llama2" | ||
model_type = "llama3" | ||
|
||
wandb = true | ||
log_all_ranks = true | ||
|
||
[hardware] | ||
micro_batch_size = 64 | ||
reshard_after_forward = false | ||
micro_batch_size = 8 | ||
reshard_after_forward = true | ||
torch_compile = false | ||
attn_fn="sdpa" | ||
|
||
[train] | ||
batch_size = 512 | ||
|
||
[train.lr_scheduler] | ||
lr = 3e-4 | ||
end_lr = 0.0 | ||
num_warmup_steps = 8000 | ||
num_decay_steps = 1.2e6 | ||
|
||
[train.outer_lr_scheduler] | ||
lr = 0.7 | ||
end_lr = 0.7 | ||
num_decay_steps = 0 | ||
num_warmup_steps = 0 | ||
num_stable_steps = 0 | ||
|
||
[train.outer_optimizer] | ||
type = "sgd" | ||
|
||
[data] | ||
seq_length = 1024 | ||
dataset_name_or_paths = "datasets/fineweb-edu" | ||
dataset_name_or_paths = 'http://65.108.32.176:8080/api/v1/datasets/fineweb-edu-train/stream' | ||
token_bit_size = 17 | ||
|
||
[diloco] | ||
inner_steps = 64 | ||
delayed_update = true | ||
|
||
[pccl] | ||
ccoip_host = "127.0.0.1:48148" |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,20 +1,39 @@ | ||
model_name = "debugmodel" | ||
model_type = "llama2" | ||
model_name = "150M" | ||
model_type = "llama3" | ||
|
||
wandb = false | ||
log_all_ranks = true | ||
|
||
[hardware] | ||
micro_batch_size = 8 | ||
micro_batch_size = 32 | ||
torch_compile = true | ||
|
||
[train] | ||
batch_size = 16 | ||
batch_size = 512 | ||
|
||
[train.lr_scheduler] | ||
num_warmup_steps = 10 | ||
num_decay_steps = 10 | ||
num_decay_steps = 1000 | ||
|
||
[train.outer_lr_scheduler] | ||
lr = 1.0 | ||
end_lr = 1.0 | ||
num_decay_steps = 0 | ||
num_warmup_steps = 0 | ||
num_stable_steps = 0 | ||
|
||
[train.outer_optimizer] | ||
type = "sgd" | ||
|
||
[data] | ||
fake = true | ||
#dataset_name_or_paths = 'tests/test_data/parquet/parquet_ds_folder_1,tests/test_data/parquet/parquet_ds_folder_2' | ||
dataset_name_or_paths = '/home/mike/IntelliJProjects/dataproctest/working_dir/train' | ||
#dataset_ratio = "50:50" | ||
token_bit_size = 17 | ||
|
||
[diloco] | ||
inner_steps = 5 | ||
inner_steps = 16 | ||
delayed_update = true | ||
|
||
[pccl] | ||
ccoip_host = "127.0.0.1:48148" |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -24,3 +24,6 @@ num_warmup_steps = 1000 | |
lr = 3e-4 | ||
end_lr = 0.0 | ||
num_decay_steps = 80000 | ||
|
||
[pccl] | ||
ccoip_host = "127.0.0.1:48148" |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file was deleted.
Oops, something went wrong.
This file was deleted.
Oops, something went wrong.
This file was deleted.
Oops, something went wrong.
Oops, something went wrong.
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Uh oh!
There was an error while loading. Please reload this page.