Releases: microsoft/ltp-megatron-lm
Releases · microsoft/ltp-megatron-lm
Release ltp-megatron-lm v0.0.5
ltp-megatron-lm v0.0.5 Release Notes
CI/CD
- Fix failures in tests/unit_tests/test_checkpointing.py (#51)
- Fix unit tests caused by torch.load API change (#59)
- Recover single-session tests (#64)
- Adjust enabled and disabled tests (#69)
Framework Features
MoE
- Fix all-reduce for global-batch load balancing loss (#68)
Release ltp-megatron-lm v0.0.4
Release ltp-megatron-lm v0.0.3
ltp-megatron-lm v0.0.3 Release Notes
Framework Features
CI/CD
Checkpoint
Logging
Release ltp-megatron-lm v0.0.2
Release ltp-megatron-lm v0.0.1
ltp-megatron-lm v0.0.1 Release Notes
Framework Features
Checkpoint
- Fix checkpoint convert when using async save (#10)
- Support triggering manual GC after checkpoint (#12)
- Upload checkpoints to Azure Blob (#19)
- Recalculate rampup batch size and data offset (#20)
- Support isolated checkpoint saving (#24)
Dataloader
- Improve dataset weighted blending (#15)
Logging
- Add customized wandb logs (#11)
- Add global batch token per expert in wandb (#18)
- Add MoE global batch loss metrics (#21)
Others
- Disable fused kernel building for ROCm (#8)
- Remove redundant grad stats when --log-num-zeros-in-grad is not enabled (#9)
Model Support
Algorithm
- Add cross entropy label smoothing (#16)
- Support normal distribution initialization for output layers (#22)
- Add Kaiming init option for MoE router weights (#25)
MoE
- Add global batch load balancing loss (#7)
- Support fine-grained recompute for MoE layer (#13)
- Support gradient scale and normalization for MoE router (#17)
- Add option to use different score function for aux loss (#26)
Optimizer
- Fix the issue where the learning rate is not overridden when using the --override-opt_param-scheduler (#14)
Tokenizer
- Add option to allow trust_remote_code for HuggingFace Tokenizer (#23)
Documentation & Repo
Documentation
Repo
- Initiate code owners (#30)