Changes for removing unused terms in CE loss fn #1643

gupta-abhay · 2024-11-06T16:42:03Z

We need to deprecate this in favor of the new changes to account for the correct loss calc. based on tokens in this PR

llmfoundry/models/mpt/modeling_mpt.py

gupta-abhay · 2024-11-19T05:53:35Z

Run logs:

WARNING:runtime_private_plugins.utils.config_utils:Sequence parallelism is only supported for accumulating the batch on tokens. Setting accumulate_train_batch_on_tokens to True.             
DEBUG: llmfoundry.command_utils.train: Initializing dist with device...                                                                    
DEBUG: llmfoundry.command_utils.train: Testing barrier with device...                                                                      
DEBUG: llmfoundry.command_utils.train: Barrier test passed with device.                                                                    
INFO: llmfoundry.command_utils.train: Building tokenizer...                                                                                
INFO: llmfoundry.command_utils.train: Building train loader...                                                                             
INFO: streaming.base.dataset: Because `predownload` was not specified, it will default to 8*batch_size if batch_size is not None, otherwise
 64.                                                                                                                                                                                          
INFO: llmfoundry.command_utils.train: Building eval loader...                                                                              
INFO: llmfoundry.command_utils.train: Initializing model...
DEBUG: llmfoundry.models.mpt.modeling_mpt: Using kaiming_normal_ initialization.                          
INFO: llmfoundry.command_utils.train: Building trainer...    
INFO: composer.utils.reproducibility: Setting seed to 24                                                                                   
INFO: composer.trainer.trainer: Run name: interactive-j1KMfR                                                                               
INFO: composer.core.state: Automatically setting data_parallel_shard to have parallelization degree 8.
/usr/lib/python3/dist-packages/composer/trainer/trainer.py:1630: UserWarning: Specifying `eval_interval=500ba` without an `eval_dataloader` has no effect. If trying to run an evaluator, make
 sure `eval_dataloader` is specified. Otherwise, set `eval_interval` to 0 or default value 1.                                                                                                 
  warnings.warn(                                                                               
INFO: composer.trainer.trainer: Stepping schedulers every batch. To step schedulers every epoch, set `step_schedulers_every_batch=False`.  
/usr/lib/python3/dist-packages/torch/storage.py:414: FutureWarning: You are using `torch.load` with `weights_only=False` (the current default value), which uses the default pickle module imp
licitly. It is possible to construct malicious pickle data which will execute arbitrary code during unpickling (See https://github.com/pytorch/pytorch/blob/main/SECURITY.md#untrusted-models 
for more details). In a future release, the default value for `weights_only` will be flipped to `True`. This limits the functions that could be executed during unpickling. Arbitrary objects 
will no longer be allowed to be loaded via this mode unless they are explicitly allowlisted by the user via `torch.serialization.add_safe_globals`. We recommend you start setting `weights_on
ly=True` for any use case where you don't have full control of the loaded file. Please open an issue on GitHub for any issues related to this experimental feature.                           
  return torch.load(io.BytesIO(b))                                                                                                                                                            
INFO: composer.utils.reproducibility: Setting seed to 17                                                                                   
DEBUG: composer.utils.reproducibility: Restoring the RNG state                                                                             
DEBUG: composer.loggers.mosaicml_logger: Logging model initialized time to metadata                                                        
INFO: composer.trainer.trainer: Setting seed to 24                                                                                         
INFO: composer.utils.reproducibility: Setting seed to 24                                                                                   
INFO: llmfoundry.command_utils.train: Logging config                                                                                       
INFO: llmfoundry.command_utils.train: Starting training...                                                                                 
INFO: composer.trainer.trainer: Using precision Precision.AMP_BF16                                                                         
DEBUG: composer.trainer.trainer: Spinning the dataloaders                                                                                  
DEBUG: composer.trainer.trainer: Starting training loop                                                                                    
INFO: streaming.base.dataset: Because `num_canonical_nodes` was not specified, and `shuffle_algo` is py1e, it will default to be equal to p
hysical nodes.                                                                                                                                                                                
INFO: streaming.base.dataset: Because `shuffle_block_size` was not specified, it will default to max(4_000_000 // num_canonical_nodes, 1 <<
 18) if num_canonical_nodes is not None, otherwise 262144.

Logs between batches:

[batch=15/4800]:                                                                               
         Train time/batch: 14                                                                  
         Train time/sample: 3584                                                               
         Train time/batch_in_epoch: 14                                                         
         Train time/sample_in_epoch: 3584                                                      
         Train time/token: 7340032                                                             
         Train time/token_in_epoch: 7340032                                                    
         Train memory/current_allocated_mem: 0.4178                
         Train memory/current_active_mem: 0.4178                                               
         Train memory/current_inactive_mem: 0.1841                                             
         Train memory/current_reserved_mem: 1.6840
         Train memory/peak_allocated_mem: 1.0126                                                                                                                                              
         Train memory/peak_active_mem: 1.0399                                                  
         Train memory/peak_inactive_mem: 0.6321                                                                                                                                               
         Train memory/peak_reserved_mem: 1.6840                                                
         Train memory/alloc_retries: 0                                                         
         Train trainer/device_train_microbatch_size: 0.5000                                    
         Train loss/train/total: 9.2921                                                        
         Train metrics/train/LanguageCrossEntropy: 9.2928                                      
         Train metrics/train/LanguagePerplexity: 10859.7646                                    
         Train metrics/train/TokenAccuracy: 0.0460                                             
         Train throughput/batches_per_sec: 0.1895                            
         Train throughput/samples_per_sec: 48.5081
         Train throughput/device/batches_per_sec: 0.0237                                       
         Train throughput/device/samples_per_sec: 6.0635                                       
         Train throughput/tokens_per_sec: 99344.5516                                           
         Train throughput/device/tokens_per_sec: 12418.0689                                    
         Train throughput/flops_per_sec: 96257336207502.8594                                                                                                                                  
         Train throughput/device/flops_per_sec: 12032167025937.8574                            
         Train throughput/device/mfu: 0.0122                                                                                                                                                  
         Train time/train: 0.0231                                                                                                                                                             
         Train time/val: 0.0000                                                                                                                                                               
         Train time/total: 0.0231                                                                                                                                                             
         Train lr-DecoupledAdamW/group0: 0.0001                                                                                                                                               
         Train time/remaining_estimate: 6.9695                                                                                                                                                
[batch=16/4800]:                                                                                                                                                                              
         Train time/batch: 15                                                                                                                                                                 
         Train time/sample: 3840                                                                                                                                                              
         Train time/batch_in_epoch: 15                                                                                                                                                        
         Train time/sample_in_epoch: 3840                                                                                                                                                     
         Train time/token: 7864320                                                                                                                                                            
         Train time/token_in_epoch: 7864320                                                                                                                                                   
         Train memory/current_allocated_mem: 0.4178                                                                                                                                           
         Train memory/current_active_mem: 0.4178                                                                                                                                              
         Train memory/current_inactive_mem: 0.1841                                             
         Train memory/current_reserved_mem: 1.6840
         Train memory/peak_allocated_mem: 1.0126                                               
         Train memory/peak_active_mem: 1.0399                                                  
         Train memory/peak_inactive_mem: 0.6321                                                
         Train memory/peak_reserved_mem: 1.6840                                                
         Train memory/alloc_retries: 0                                                         
         Train trainer/device_train_microbatch_size: 0.5000                                    
         Train loss/train/total: 9.1863              
         Train metrics/train/LanguageCrossEntropy: 9.1856                                                  
         Train metrics/train/LanguagePerplexity: 9755.9043                                                 
         Train metrics/train/TokenAccuracy: 0.0482                                                         
         Train throughput/batches_per_sec: 0.1901                                                                         
         Train throughput/samples_per_sec: 48.6691                                                                        
         Train throughput/device/batches_per_sec: 0.0238                                                                  
         Train throughput/device/samples_per_sec: 6.0836                                                                  
         Train throughput/tokens_per_sec: 99674.2310                                                                      
         Train throughput/device/tokens_per_sec: 12459.2789                                                               
         Train throughput/flops_per_sec: 96576770543371.9688                                                              
         Train throughput/device/flops_per_sec: 12072096317921.4961                                                       
         Train throughput/device/mfu: 0.0122                 
         Train time/train: 0.0246                            
         Train time/val: 0.0000                              
         Train time/total: 0.0246                            
         Train lr-DecoupledAdamW/group0: 0.0001                                                                           
         Train time/remaining_estimate: 6.9653                         
[batch=17/4800]:                                                       
         Train time/batch: 16                                          
         Train time/sample: 4096                                       
         Train time/batch_in_epoch: 16                                 
         Train time/sample_in_epoch: 4096                              
         Train time/token: 8388608                                     
         Train time/token_in_epoch: 8388608                            
         Train memory/current_allocated_mem: 0.4178                                                                                            
         Train memory/current_active_mem: 0.4178                                                                                               
         Train memory/current_inactive_mem: 0.1841                                                                                             
         Train memory/current_reserved_mem: 1.6840

Do not see any logs for deprecation. cc: @dakinggg (Are we good on this?)

llmfoundry/models/mpt/modeling_mpt.py

changes for removing unused terms

27e9694

gupta-abhay changed the title ~~WIP: changes for removing unused terms~~ Changes for removing unused terms in CE loss fn Nov 6, 2024

changes for deprecation

b27f49f

gupta-abhay marked this pull request as ready for review November 6, 2024 18:00

gupta-abhay requested a review from a team as a code owner November 6, 2024 18:00

dakinggg reviewed Nov 6, 2024

View reviewed changes

llmfoundry/models/mpt/modeling_mpt.py Show resolved Hide resolved

dakinggg reviewed Nov 6, 2024

View reviewed changes

llmfoundry/models/mpt/modeling_mpt.py Show resolved Hide resolved

dakinggg and others added 2 commits November 6, 2024 11:46

Merge branch 'main' into abhay/deprecate_seqparallel_terms

8d6bc67

changes to add func. back

ed56f27

dakinggg reviewed Nov 7, 2024

View reviewed changes

llmfoundry/models/mpt/modeling_mpt.py Show resolved Hide resolved

gupta-abhay and others added 3 commits November 11, 2024 05:53

changes for comments

8d791cb

Merge branch 'main' into abhay/deprecate_seqparallel_terms

9c1f3d3

Merge branch 'main' into abhay/deprecate_seqparallel_terms

1edcc30

dakinggg reviewed Nov 19, 2024

View reviewed changes

llmfoundry/models/mpt/modeling_mpt.py Outdated Show resolved Hide resolved

Update llmfoundry/models/mpt/modeling_mpt.py

2944e17

dakinggg approved these changes Nov 19, 2024

View reviewed changes

gupta-abhay merged commit ce13961 into main Nov 19, 2024
9 checks passed

gupta-abhay deleted the abhay/deprecate_seqparallel_terms branch June 5, 2025 15:05

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Changes for removing unused terms in CE loss fn #1643

Changes for removing unused terms in CE loss fn #1643

Uh oh!

gupta-abhay commented Nov 6, 2024

Uh oh!

Uh oh!

Uh oh!

Uh oh!

gupta-abhay commented Nov 19, 2024 •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Changes for removing unused terms in CE loss fn #1643

Changes for removing unused terms in CE loss fn #1643

Uh oh!

Conversation

gupta-abhay commented Nov 6, 2024

Uh oh!

Uh oh!

Uh oh!

Uh oh!

gupta-abhay commented Nov 19, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

gupta-abhay commented Nov 19, 2024 •

edited

Loading