Skip to content

First Version Release

Choose a tag to compare

@wenxie-amd wenxie-amd released this 13 Aug 07:59
· 1 commit to v0.1.0-rc1 since this release
9a98d01

What's Changed

  • hipblaslt auto tune by @wenxie-amd in #23
  • fix(examples): fix Megatron path used in prepare_dataset by @Xiaoming-AMD in #24
  • fix(torch_fsdp): get ddp_config failed when use torch_fsdp by @Xiaoming-AMD in #27
  • preflight by @wenxie-amd in #25
  • merge trace file by @wenxie-amd in #31
  • add inter-node ring p2p test by @limou102 in #30
  • feat(megatron): Align TFLOPs calculation for megatron by @Xiaoming-AMD in #28
  • [Fix] fp8 option not work by @RuibinCheung in #33
  • feat(HSA ENV): tune ROCm runtime with HSA_NO_SCRATCH_RECLAIM and HSA_ENABLE_SDMA by @Xiaoming-AMD in #32
  • feat(config parse): replace yaml value(int/float ) from env('KEY') by @Xiaoming-AMD in #34
  • feat(fsdp): patch Megatron torch_FSDP2 with Primus implementation by @Xiaoming-AMD in #35
  • Add README by @wenxie-amd in #36
  • fix typo error of preflight script by @wenxie-amd in #37
  • [Feat] Add tensile tuning example by @RuibinCheung in #38
  • refactor(examples): simplify usage and improve structure for clarity by @Xiaoming-AMD in #39
  • test:Add model-specific Megatron trainer test cases with isolated test config by @Xiaoming-AMD in #40
  • Primus benchmark by @xiaobochen-amd in #43
  • docs(contributing): add initial contributing guidelines by @Xiaoming-AMD in #42
  • Dev/yaoc/mixtral by @ChengYao-amd in #44
  • fix fast async checkpoint on ROCm by @limou102 in #46
  • docs & refactor: reorganize README, unify config usage, and improve Megatron pretrain scripts for Primus by @Xiaoming-AMD in #45
  • optimize: reduce FP8 training memory usage via scoped TE layer config overrides by @Xiaoming-AMD in #47
  • fix(megatron): add missing import for 'inspect' in TE kwargs patch by @Xiaoming-AMD in #49
  • chore(submodule): update Megatron-LM from 20250324 to 20250522 by @Xiaoming-AMD in #50
  • refactor: improve benchmark runner and report parser with multi-node support by @Xiaoming-AMD in #51
  • feature(model): add mixtral pretrain config by @ChengYao-amd in #52
  • update trace_moe_metric call to fit new megatron interface by @ChengYao-amd in #53
  • fix(Megatron): fix interleaved virtual pipeline training error and add corresponding UT by @lhzhang333 in #54
  • opt(UT): add num_workers=1 in UT yaml to save most of the time on exit by @lhzhang333 in #56
  • Update mixtral pretrain configs by @yuankaichen-amd in #55
  • refactor(docker): Update docker image to v25.5_py310 by @wenxie-amd in #57
  • feat(config): Update LLaMA pretrain configs by @Xiaoming-AMD in #58
  • feature(RDMA): Add filtering for gpu RDMA network adapters by @chaojhou in #59
  • fix(trainer-test): improve training script success detection using stdout … by @Xiaoming-AMD in #63
  • chore(license): add MIT LICENSE file for Primus by @Xiaoming-AMD in #61
  • refactor: move Megatron run scripts to examples root and add --backend parameter for multi-backend support by @Xiaoming-AMD in #64
  • feat(torchtitan): Add TorchTitan Backend Support (Initial Stub) by @Xiaoming-AMD in #65
  • feat(torchtitan): add --local-ranks-filter support in torchrun launcher by @Xiaoming-AMD in #67
  • fix(slurm): remove --reservation flag and quote variables in run_slurm_pretrain.sh by @Xiaoming-AMD in #68
  • feat(megatron): enable manual pipeline split in (interleaved) 1F1B-PP by monkey patching by @lhzhang333 in #69
  • rebase main to instella branch by @wenxie-amd in #71
  • fix(ip-interface): socket interface env regression by @Xiaoming-AMD in #70
  • feat: Add run_k8s_pretrain interface for Kubernetes workload submission by @Xiaoming-AMD in #72
  • feat(run_k8s_pretrain): support --workspace and improve job spec defaults by @Xiaoming-AMD in #73
  • feat(megatron): support mock_data mode to skip dataset preparation by @Xiaoming-AMD in #74
  • feat(k8s_pretrain): support log in stdout and file by @chaojhou in #76
  • feat(k8s): Support for Node Selection via --nodelist and Add nodes by @Xiaoming-AMD in #75
  • docs: add TorchTitan backend support entry to README by @Xiaoming-AMD in #78
  • add benchmark for checkpoint saving by @limou102 in #81
  • feat(torchtitan):Add model configs for LLaMA3-405B and LLaMA3-70B (TorchTitan) by @Xiaoming-AMD in #82
  • feat(tp-overlap): add te backend and support tp overlap for megatron. by @zhenhuang12 in #79
  • feat(benchmark): update kernel benchmark and add llama405B config by @xiaobochen-amd in #77
  • llama3.1_405B model config by @wenxie-amd in #84
  • print training envs by @wenxie-amd in #85
  • add checkpoint loading metrics by @limou102 in #86
  • feat: add new ckpt args of megatron by @wenxie-amd in #88
  • doc: Add Mistral Models and Fix Formatting in examples/README.md by @Xiaoming-AMD in #87
  • refactor(cli): Enhance Primus CLI with --override Support & Simplify Platform Defaults by @Xiaoming-AMD in #89
  • chore(license): add AMD license headers by @Xiaoming-AMD in #90
  • feat(k8s launch):Support forwarding unrecognized --args to ENTRY_POINT by @Xiaoming-AMD in #91
  • fix(megatron): sync initialize_megatron of primus with that of megatron by @lhzhang333 in #93
  • enable deepseek qk_layernorm by @wenxie-amd in #94
  • checkout Primus-Turbo by github secret by @wenxie-amd in #96
  • feat(tp-overlap): support torchtitan by patch fused_all_gather_matmul of torch op by @zhenhuang12 in #92
  • add deprecated_20251209 moe layer by @wenxie-amd in #98
  • feat(megatron): add attn warmup to save iter1's time when pp is used by @lhzhang333 in #97
  • Primus Config/Patch Document by @wenxie-amd in #100
  • feat(megatron): enable dumping pp schedule data and add pp visualization tool by @lhzhang333 in #99
  • add patch readme for attn_warmup and decoder_pipeline_manual_split_list by @lhzhang333 in #101
  • feat(megatron): add model and pretrain config for LLaMA3.1-405B by @Xiaoming-AMD in #102
  • refactor: Refactor Torchtitan Config & Launch: YAML Unification, Backend Auto-Selection by @Xiaoming-AMD in #106
  • refactor(torchtitan): switch llama3 configs from TOML to YAML by @Xiaoming-AMD in #108
  • doc(examples): Rename Torchtitan LLaMA3 Configs to LLaMA3.1 and Update README Links by @Xiaoming-AMD in #110
  • Add tas k8s runner's ci file by @haishuok0525 in #109
  • test(megatron): add Mixtral-8x22B/Mixtral-8x7B test and TRAIN_LOG override support by @Xiaoming-AMD in #114
  • Speedup primus-turbo build in k8s-ci runner by @wenxie-amd in #113
  • fix(trainer): auto-enable tensorboard when profiling is enabled by @Xiaoming-AMD in #116
  • Code isolation from shared path by @haishuok0525 in #119
  • feat(megatron): add moe_use_fused_router_with_aux_score by @ChengYao-amd in #111
  • [UT] Add deterministic extra check and unit test by @RuibinCheung in #115
  • feat(turbo): Primus-Torchtitan support Primus-Turbo backend. by @xiaobochen-amd in #118
  • permute fusion and padded mla attention by @wenxie-amd in #120
  • Add deepseek v2 config by @wenxie-amd in #121
  • fix(megatron): Correct offset calculation when vpp degree is larger than 2 by @lhzhang333 in #122
  • update config for llama2-7b and llama3.1-8b to align test by @llying-001 in #123
  • fix(tests): fix async_tp tests for warning print. by @zhenhuang12 in #125
  • feat(megatron): mixtral & llama parallel tuning by @Xiaoming-AMD in #126
  • fix(megatron): sync pp related code in primus trainer with megatron by @lhzhang333 in #129
  • feat(tp-overlap): support te and torchtitan by patch fused_matmul_reduce_scatter of torch op by @llying-001 in #112
  • feat(turbo): Primus-Megatron support Primus-Turbo backend by @kyle-256 in #124
  • ci(install-turbo): update commit to remove triton_dist. by @zhenhuang12 in #130
  • Reduce trace file size in torch.profile by @wenxie-amd in #131
  • feat(offline_tune): support generate reports by @RuibinCheung in #133
  • feat(turbo): fit PrimusTurboAttention with cp by @ChengYao-amd in #134
  • fix(config): update gradient_accumulation_fusion and moe_use_legacy_grouped_gemm comment by @RuibinCheung in #135
  • feat(async_tp): support fp8 for all_gather + matmul. by @zhenhuang12 in #132
  • add NCCL_IB_HCA, CLEAN_DOCKER_CONTAINER by @wenxie-amd in #138
  • add 515B/1T/2T/4T proxy model; update launcher by @wenxie-amd in #150
  • PR for primus/megatron v25.7 release by @vidushi8 in #145

Full Changelog: https://github.com/AMD-AIG-AIMA/Primus/commits/v0.1.0-alpha