-
Notifications
You must be signed in to change notification settings - Fork 580
Changes for removing unused terms in CE loss fn #1643
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
Merged
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
dakinggg
reviewed
Nov 6, 2024
dakinggg
reviewed
Nov 6, 2024
dakinggg
reviewed
Nov 7, 2024
Contributor
Author
|
Run logs: WARNING:runtime_private_plugins.utils.config_utils:Sequence parallelism is only supported for accumulating the batch on tokens. Setting accumulate_train_batch_on_tokens to True.
DEBUG: llmfoundry.command_utils.train: Initializing dist with device...
DEBUG: llmfoundry.command_utils.train: Testing barrier with device...
DEBUG: llmfoundry.command_utils.train: Barrier test passed with device.
INFO: llmfoundry.command_utils.train: Building tokenizer...
INFO: llmfoundry.command_utils.train: Building train loader...
INFO: streaming.base.dataset: Because `predownload` was not specified, it will default to 8*batch_size if batch_size is not None, otherwise
64.
INFO: llmfoundry.command_utils.train: Building eval loader...
INFO: llmfoundry.command_utils.train: Initializing model...
DEBUG: llmfoundry.models.mpt.modeling_mpt: Using kaiming_normal_ initialization.
INFO: llmfoundry.command_utils.train: Building trainer...
INFO: composer.utils.reproducibility: Setting seed to 24
INFO: composer.trainer.trainer: Run name: interactive-j1KMfR
INFO: composer.core.state: Automatically setting data_parallel_shard to have parallelization degree 8.
/usr/lib/python3/dist-packages/composer/trainer/trainer.py:1630: UserWarning: Specifying `eval_interval=500ba` without an `eval_dataloader` has no effect. If trying to run an evaluator, make
sure `eval_dataloader` is specified. Otherwise, set `eval_interval` to 0 or default value 1.
warnings.warn(
INFO: composer.trainer.trainer: Stepping schedulers every batch. To step schedulers every epoch, set `step_schedulers_every_batch=False`.
/usr/lib/python3/dist-packages/torch/storage.py:414: FutureWarning: You are using `torch.load` with `weights_only=False` (the current default value), which uses the default pickle module imp
licitly. It is possible to construct malicious pickle data which will execute arbitrary code during unpickling (See https://github.com/pytorch/pytorch/blob/main/SECURITY.md#untrusted-models
for more details). In a future release, the default value for `weights_only` will be flipped to `True`. This limits the functions that could be executed during unpickling. Arbitrary objects
will no longer be allowed to be loaded via this mode unless they are explicitly allowlisted by the user via `torch.serialization.add_safe_globals`. We recommend you start setting `weights_on
ly=True` for any use case where you don't have full control of the loaded file. Please open an issue on GitHub for any issues related to this experimental feature.
return torch.load(io.BytesIO(b))
INFO: composer.utils.reproducibility: Setting seed to 17
DEBUG: composer.utils.reproducibility: Restoring the RNG state
DEBUG: composer.loggers.mosaicml_logger: Logging model initialized time to metadata
INFO: composer.trainer.trainer: Setting seed to 24
INFO: composer.utils.reproducibility: Setting seed to 24
INFO: llmfoundry.command_utils.train: Logging config
INFO: llmfoundry.command_utils.train: Starting training...
INFO: composer.trainer.trainer: Using precision Precision.AMP_BF16
DEBUG: composer.trainer.trainer: Spinning the dataloaders
DEBUG: composer.trainer.trainer: Starting training loop
INFO: streaming.base.dataset: Because `num_canonical_nodes` was not specified, and `shuffle_algo` is py1e, it will default to be equal to p
hysical nodes.
INFO: streaming.base.dataset: Because `shuffle_block_size` was not specified, it will default to max(4_000_000 // num_canonical_nodes, 1 <<
18) if num_canonical_nodes is not None, otherwise 262144.Logs between batches: [batch=15/4800]:
Train time/batch: 14
Train time/sample: 3584
Train time/batch_in_epoch: 14
Train time/sample_in_epoch: 3584
Train time/token: 7340032
Train time/token_in_epoch: 7340032
Train memory/current_allocated_mem: 0.4178
Train memory/current_active_mem: 0.4178
Train memory/current_inactive_mem: 0.1841
Train memory/current_reserved_mem: 1.6840
Train memory/peak_allocated_mem: 1.0126
Train memory/peak_active_mem: 1.0399
Train memory/peak_inactive_mem: 0.6321
Train memory/peak_reserved_mem: 1.6840
Train memory/alloc_retries: 0
Train trainer/device_train_microbatch_size: 0.5000
Train loss/train/total: 9.2921
Train metrics/train/LanguageCrossEntropy: 9.2928
Train metrics/train/LanguagePerplexity: 10859.7646
Train metrics/train/TokenAccuracy: 0.0460
Train throughput/batches_per_sec: 0.1895
Train throughput/samples_per_sec: 48.5081
Train throughput/device/batches_per_sec: 0.0237
Train throughput/device/samples_per_sec: 6.0635
Train throughput/tokens_per_sec: 99344.5516
Train throughput/device/tokens_per_sec: 12418.0689
Train throughput/flops_per_sec: 96257336207502.8594
Train throughput/device/flops_per_sec: 12032167025937.8574
Train throughput/device/mfu: 0.0122
Train time/train: 0.0231
Train time/val: 0.0000
Train time/total: 0.0231
Train lr-DecoupledAdamW/group0: 0.0001
Train time/remaining_estimate: 6.9695
[batch=16/4800]:
Train time/batch: 15
Train time/sample: 3840
Train time/batch_in_epoch: 15
Train time/sample_in_epoch: 3840
Train time/token: 7864320
Train time/token_in_epoch: 7864320
Train memory/current_allocated_mem: 0.4178
Train memory/current_active_mem: 0.4178
Train memory/current_inactive_mem: 0.1841
Train memory/current_reserved_mem: 1.6840
Train memory/peak_allocated_mem: 1.0126
Train memory/peak_active_mem: 1.0399
Train memory/peak_inactive_mem: 0.6321
Train memory/peak_reserved_mem: 1.6840
Train memory/alloc_retries: 0
Train trainer/device_train_microbatch_size: 0.5000
Train loss/train/total: 9.1863
Train metrics/train/LanguageCrossEntropy: 9.1856
Train metrics/train/LanguagePerplexity: 9755.9043
Train metrics/train/TokenAccuracy: 0.0482
Train throughput/batches_per_sec: 0.1901
Train throughput/samples_per_sec: 48.6691
Train throughput/device/batches_per_sec: 0.0238
Train throughput/device/samples_per_sec: 6.0836
Train throughput/tokens_per_sec: 99674.2310
Train throughput/device/tokens_per_sec: 12459.2789
Train throughput/flops_per_sec: 96576770543371.9688
Train throughput/device/flops_per_sec: 12072096317921.4961
Train throughput/device/mfu: 0.0122
Train time/train: 0.0246
Train time/val: 0.0000
Train time/total: 0.0246
Train lr-DecoupledAdamW/group0: 0.0001
Train time/remaining_estimate: 6.9653
[batch=17/4800]:
Train time/batch: 16
Train time/sample: 4096
Train time/batch_in_epoch: 16
Train time/sample_in_epoch: 4096
Train time/token: 8388608
Train time/token_in_epoch: 8388608
Train memory/current_allocated_mem: 0.4178
Train memory/current_active_mem: 0.4178
Train memory/current_inactive_mem: 0.1841
Train memory/current_reserved_mem: 1.6840Do not see any logs for deprecation. cc: @dakinggg (Are we good on this?) |
dakinggg
reviewed
Nov 19, 2024
dakinggg
approved these changes
Nov 19, 2024
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
We need to deprecate this in favor of the new changes to account for the correct loss calc. based on tokens in this PR