11 Jul 03:07

yzygitzh

Release ltp-megatron-lm v0.0.5 Latest

Latest

ltp-megatron-lm v0.0.5 Release Notes

CI/CD

Fix failures in tests/unit_tests/test_checkpointing.py (#51)
Fix unit tests caused by torch.load API change (#59)
Recover single-session tests (#64)
Adjust enabled and disabled tests (#69)

Framework Features

MoE

Fix all-reduce for global-batch load balancing loss (#68)

Assets 2

26 Jun 17:45

yzygitzh

Release ltp-megatron-lm v0.0.4

ltp-megatron-lm v0.0.4 Release Notes

Framework Features

CI/CD

Recover test_errors_are_reported test case in dist_checkpointing for AMD (#57)

Logging

Fix per-layer grad norm logging for VPP (#55)

Assets 2

20 Jun 08:14

yzygitzh

Release ltp-megatron-lm v0.0.3

ltp-megatron-lm v0.0.3 Release Notes

Framework Features

CI/CD

Reduce resource usage by NCCL during unit tests (#45)
Fix bug in #45 (#47)

Checkpoint

Fix wrong offset in grouped gemm for dist checkpoint (#29)
Support dist checkpoint upload (#38)

Logging

Add per-layer grad norm logs (#33)
Log MoE token metrics for all kinds of aux losses (#41)
Fix bug in #41 (#54)

Assets 2

13 Jun 16:48

yzygitzh

Release ltp-megatron-lm v0.0.2

ltp-megatron-lm v0.0.2 Release Notes

Framework Features

CI/CD

Initiate CI/CD pipeline for unit tests (#37 and #47)

Checkpoint

Disable non-blocking when saving dist checkpoint in AMD GPUs (#28)

Assets 2

30 May 02:00

yzygitzh

Release ltp-megatron-lm v0.0.1

ltp-megatron-lm v0.0.1 Release Notes

Framework Features

Checkpoint

Fix checkpoint convert when using async save (#10)
Support triggering manual GC after checkpoint (#12)
Upload checkpoints to Azure Blob (#19)
Recalculate rampup batch size and data offset (#20)
Support isolated checkpoint saving (#24)

Dataloader

Improve dataset weighted blending (#15)

Logging

Add customized wandb logs (#11)
Add global batch token per expert in wandb (#18)
Add MoE global batch loss metrics (#21)

Others

Disable fused kernel building for ROCm (#8)
Remove redundant grad stats when --log-num-zeros-in-grad is not enabled (#9)

Model Support

Algorithm

Add cross entropy label smoothing (#16)
Support normal distribution initialization for output layers (#22)
Add Kaiming init option for MoE router weights (#25)

MoE

Add global batch load balancing loss (#7)
Support fine-grained recompute for MoE layer (#13)
Support gradient scale and normalization for MoE router (#17)
Add option to use different score function for aux loss (#26)

Optimizer

Fix the issue where the learning rate is not overridden when using the --override-opt_param-scheduler (#14)

Tokenizer

Add option to allow trust_remote_code for HuggingFace Tokenizer (#23)

Documentation & Repo

Documentation

Add Microsoft SECURITY.MD (#2 and #5)
Add License (#4)

Repo

Initiate code owners (#30)

Assets 2