Skip to content

Releases: microsoft/ltp-megatron-lm

Release ltp-megatron-lm v0.0.5

11 Jul 03:07
9c0df79

Choose a tag to compare

ltp-megatron-lm v0.0.5 Release Notes

CI/CD

  1. Fix failures in tests/unit_tests/test_checkpointing.py (#51)
  2. Fix unit tests caused by torch.load API change (#59)
  3. Recover single-session tests (#64)
  4. Adjust enabled and disabled tests (#69)

Framework Features

MoE

  1. Fix all-reduce for global-batch load balancing loss (#68)

Release ltp-megatron-lm v0.0.4

26 Jun 17:45
5653fb6

Choose a tag to compare

ltp-megatron-lm v0.0.4 Release Notes

Framework Features

CI/CD

  1. Recover test_errors_are_reported test case in dist_checkpointing for AMD (#57)

Logging

  1. Fix per-layer grad norm logging for VPP (#55)

Release ltp-megatron-lm v0.0.3

20 Jun 08:14
2ced5cf

Choose a tag to compare

ltp-megatron-lm v0.0.3 Release Notes

Framework Features

CI/CD

  1. Reduce resource usage by NCCL during unit tests (#45)
  2. Fix bug in #45 (#47)

Checkpoint

  1. Fix wrong offset in grouped gemm for dist checkpoint (#29)
  2. Support dist checkpoint upload (#38)

Logging

  1. Add per-layer grad norm logs (#33)
  2. Log MoE token metrics for all kinds of aux losses (#41)
  3. Fix bug in #41 (#54)

Release ltp-megatron-lm v0.0.2

13 Jun 16:48
4a292ed

Choose a tag to compare

ltp-megatron-lm v0.0.2 Release Notes

Framework Features

CI/CD

  1. Initiate CI/CD pipeline for unit tests (#37 and #47)

Checkpoint

  1. Disable non-blocking when saving dist checkpoint in AMD GPUs (#28)

Release ltp-megatron-lm v0.0.1

30 May 02:00
3279307

Choose a tag to compare

ltp-megatron-lm v0.0.1 Release Notes

Framework Features

Checkpoint

  1. Fix checkpoint convert when using async save (#10)
  2. Support triggering manual GC after checkpoint (#12)
  3. Upload checkpoints to Azure Blob (#19)
  4. Recalculate rampup batch size and data offset (#20)
  5. Support isolated checkpoint saving (#24)

Dataloader

  1. Improve dataset weighted blending (#15)

Logging

  1. Add customized wandb logs (#11)
  2. Add global batch token per expert in wandb (#18)
  3. Add MoE global batch loss metrics (#21)

Others

  1. Disable fused kernel building for ROCm (#8)
  2. Remove redundant grad stats when --log-num-zeros-in-grad is not enabled (#9)

Model Support

Algorithm

  1. Add cross entropy label smoothing (#16)
  2. Support normal distribution initialization for output layers (#22)
  3. Add Kaiming init option for MoE router weights (#25)

MoE

  1. Add global batch load balancing loss (#7)
  2. Support fine-grained recompute for MoE layer (#13)
  3. Support gradient scale and normalization for MoE router (#17)
  4. Add option to use different score function for aux loss (#26)

Optimizer

  1. Fix the issue where the learning rate is not overridden when using the --override-opt_param-scheduler (#14)

Tokenizer

  1. Add option to allow trust_remote_code for HuggingFace Tokenizer (#23)

Documentation & Repo

Documentation

  1. Add Microsoft SECURITY.MD (#2 and #5)
  2. Add License (#4)

Repo

  1. Initiate code owners (#30)