Skip to content

Conversation

@gupta-abhay
Copy link
Contributor

We need to deprecate this in favor of the new changes to account for the correct loss calc. based on tokens in this PR

@gupta-abhay gupta-abhay changed the title WIP: changes for removing unused terms Changes for removing unused terms in CE loss fn Nov 6, 2024
@gupta-abhay gupta-abhay marked this pull request as ready for review November 6, 2024 18:00
@gupta-abhay gupta-abhay requested a review from a team as a code owner November 6, 2024 18:00
@gupta-abhay
Copy link
Contributor Author

gupta-abhay commented Nov 19, 2024

Run logs:

WARNING:runtime_private_plugins.utils.config_utils:Sequence parallelism is only supported for accumulating the batch on tokens. Setting accumulate_train_batch_on_tokens to True.             
DEBUG: llmfoundry.command_utils.train: Initializing dist with device...                                                                    
DEBUG: llmfoundry.command_utils.train: Testing barrier with device...                                                                      
DEBUG: llmfoundry.command_utils.train: Barrier test passed with device.                                                                    
INFO: llmfoundry.command_utils.train: Building tokenizer...                                                                                
INFO: llmfoundry.command_utils.train: Building train loader...                                                                             
INFO: streaming.base.dataset: Because `predownload` was not specified, it will default to 8*batch_size if batch_size is not None, otherwise
 64.                                                                                                                                                                                          
INFO: llmfoundry.command_utils.train: Building eval loader...                                                                              
INFO: llmfoundry.command_utils.train: Initializing model...
DEBUG: llmfoundry.models.mpt.modeling_mpt: Using kaiming_normal_ initialization.                          
INFO: llmfoundry.command_utils.train: Building trainer...    
INFO: composer.utils.reproducibility: Setting seed to 24                                                                                   
INFO: composer.trainer.trainer: Run name: interactive-j1KMfR                                                                               
INFO: composer.core.state: Automatically setting data_parallel_shard to have parallelization degree 8.
/usr/lib/python3/dist-packages/composer/trainer/trainer.py:1630: UserWarning: Specifying `eval_interval=500ba` without an `eval_dataloader` has no effect. If trying to run an evaluator, make
 sure `eval_dataloader` is specified. Otherwise, set `eval_interval` to 0 or default value 1.                                                                                                 
  warnings.warn(                                                                               
INFO: composer.trainer.trainer: Stepping schedulers every batch. To step schedulers every epoch, set `step_schedulers_every_batch=False`.  
/usr/lib/python3/dist-packages/torch/storage.py:414: FutureWarning: You are using `torch.load` with `weights_only=False` (the current default value), which uses the default pickle module imp
licitly. It is possible to construct malicious pickle data which will execute arbitrary code during unpickling (See https://github.com/pytorch/pytorch/blob/main/SECURITY.md#untrusted-models 
for more details). In a future release, the default value for `weights_only` will be flipped to `True`. This limits the functions that could be executed during unpickling. Arbitrary objects 
will no longer be allowed to be loaded via this mode unless they are explicitly allowlisted by the user via `torch.serialization.add_safe_globals`. We recommend you start setting `weights_on
ly=True` for any use case where you don't have full control of the loaded file. Please open an issue on GitHub for any issues related to this experimental feature.                           
  return torch.load(io.BytesIO(b))                                                                                                                                                            
INFO: composer.utils.reproducibility: Setting seed to 17                                                                                   
DEBUG: composer.utils.reproducibility: Restoring the RNG state                                                                             
DEBUG: composer.loggers.mosaicml_logger: Logging model initialized time to metadata                                                        
INFO: composer.trainer.trainer: Setting seed to 24                                                                                         
INFO: composer.utils.reproducibility: Setting seed to 24                                                                                   
INFO: llmfoundry.command_utils.train: Logging config                                                                                       
INFO: llmfoundry.command_utils.train: Starting training...                                                                                 
INFO: composer.trainer.trainer: Using precision Precision.AMP_BF16                                                                         
DEBUG: composer.trainer.trainer: Spinning the dataloaders                                                                                  
DEBUG: composer.trainer.trainer: Starting training loop                                                                                    
INFO: streaming.base.dataset: Because `num_canonical_nodes` was not specified, and `shuffle_algo` is py1e, it will default to be equal to p
hysical nodes.                                                                                                                                                                                
INFO: streaming.base.dataset: Because `shuffle_block_size` was not specified, it will default to max(4_000_000 // num_canonical_nodes, 1 <<
 18) if num_canonical_nodes is not None, otherwise 262144.

Logs between batches:

[batch=15/4800]:                                                                               
         Train time/batch: 14                                                                  
         Train time/sample: 3584                                                               
         Train time/batch_in_epoch: 14                                                         
         Train time/sample_in_epoch: 3584                                                      
         Train time/token: 7340032                                                             
         Train time/token_in_epoch: 7340032                                                    
         Train memory/current_allocated_mem: 0.4178                
         Train memory/current_active_mem: 0.4178                                               
         Train memory/current_inactive_mem: 0.1841                                             
         Train memory/current_reserved_mem: 1.6840
         Train memory/peak_allocated_mem: 1.0126                                                                                                                                              
         Train memory/peak_active_mem: 1.0399                                                  
         Train memory/peak_inactive_mem: 0.6321                                                                                                                                               
         Train memory/peak_reserved_mem: 1.6840                                                
         Train memory/alloc_retries: 0                                                         
         Train trainer/device_train_microbatch_size: 0.5000                                    
         Train loss/train/total: 9.2921                                                        
         Train metrics/train/LanguageCrossEntropy: 9.2928                                      
         Train metrics/train/LanguagePerplexity: 10859.7646                                    
         Train metrics/train/TokenAccuracy: 0.0460                                             
         Train throughput/batches_per_sec: 0.1895                            
         Train throughput/samples_per_sec: 48.5081
         Train throughput/device/batches_per_sec: 0.0237                                       
         Train throughput/device/samples_per_sec: 6.0635                                       
         Train throughput/tokens_per_sec: 99344.5516                                           
         Train throughput/device/tokens_per_sec: 12418.0689                                    
         Train throughput/flops_per_sec: 96257336207502.8594                                                                                                                                  
         Train throughput/device/flops_per_sec: 12032167025937.8574                            
         Train throughput/device/mfu: 0.0122                                                                                                                                                  
         Train time/train: 0.0231                                                                                                                                                             
         Train time/val: 0.0000                                                                                                                                                               
         Train time/total: 0.0231                                                                                                                                                             
         Train lr-DecoupledAdamW/group0: 0.0001                                                                                                                                               
         Train time/remaining_estimate: 6.9695                                                                                                                                                
[batch=16/4800]:                                                                                                                                                                              
         Train time/batch: 15                                                                                                                                                                 
         Train time/sample: 3840                                                                                                                                                              
         Train time/batch_in_epoch: 15                                                                                                                                                        
         Train time/sample_in_epoch: 3840                                                                                                                                                     
         Train time/token: 7864320                                                                                                                                                            
         Train time/token_in_epoch: 7864320                                                                                                                                                   
         Train memory/current_allocated_mem: 0.4178                                                                                                                                           
         Train memory/current_active_mem: 0.4178                                                                                                                                              
         Train memory/current_inactive_mem: 0.1841                                             
         Train memory/current_reserved_mem: 1.6840
         Train memory/peak_allocated_mem: 1.0126                                               
         Train memory/peak_active_mem: 1.0399                                                  
         Train memory/peak_inactive_mem: 0.6321                                                
         Train memory/peak_reserved_mem: 1.6840                                                
         Train memory/alloc_retries: 0                                                         
         Train trainer/device_train_microbatch_size: 0.5000                                    
         Train loss/train/total: 9.1863              
         Train metrics/train/LanguageCrossEntropy: 9.1856                                                  
         Train metrics/train/LanguagePerplexity: 9755.9043                                                 
         Train metrics/train/TokenAccuracy: 0.0482                                                         
         Train throughput/batches_per_sec: 0.1901                                                                         
         Train throughput/samples_per_sec: 48.6691                                                                        
         Train throughput/device/batches_per_sec: 0.0238                                                                  
         Train throughput/device/samples_per_sec: 6.0836                                                                  
         Train throughput/tokens_per_sec: 99674.2310                                                                      
         Train throughput/device/tokens_per_sec: 12459.2789                                                               
         Train throughput/flops_per_sec: 96576770543371.9688                                                              
         Train throughput/device/flops_per_sec: 12072096317921.4961                                                       
         Train throughput/device/mfu: 0.0122                 
         Train time/train: 0.0246                            
         Train time/val: 0.0000                              
         Train time/total: 0.0246                            
         Train lr-DecoupledAdamW/group0: 0.0001                                                                           
         Train time/remaining_estimate: 6.9653                         
[batch=17/4800]:                                                       
         Train time/batch: 16                                          
         Train time/sample: 4096                                       
         Train time/batch_in_epoch: 16                                 
         Train time/sample_in_epoch: 4096                              
         Train time/token: 8388608                                     
         Train time/token_in_epoch: 8388608                            
         Train memory/current_allocated_mem: 0.4178                                                                                            
         Train memory/current_active_mem: 0.4178                                                                                               
         Train memory/current_inactive_mem: 0.1841                                                                                             
         Train memory/current_reserved_mem: 1.6840

Do not see any logs for deprecation. cc: @dakinggg (Are we good on this?)

@gupta-abhay gupta-abhay merged commit ce13961 into main Nov 19, 2024
9 checks passed
@gupta-abhay gupta-abhay deleted the abhay/deprecate_seqparallel_terms branch June 5, 2025 15:05
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants