Releases · AMD-AGI/Primus · GitHub

18 Oct 06:40

wenxie-amd

v0.4.0 Pre-release

Pre-release

What's Changed

fix(config): correct flavor to 405B in torchtitan/llama3.1_405B.yaml by @Xiaoming-AMD in #189
perf(torchtitan/config): enable compile for Llama-3.1 (8B/70B/405B) by @Xiaoming-AMD in #193
disable dump_pp_data when pp size is one by @lhzhang333 in #191
remove turbo token by @wenxie-amd in #197
feat(async-tp) change gemm_rs_overlap api for multi-stream method by @llying-001 in #171
Support for torchtitan with Primus-Turbo by @clairesonglee in #188
chore: update default rocm/megatron-lm image to v25.8_py310 by @Xiaoming-AMD in #198
perf(aiter): add AITER_JIT_DIR env for cached build to speed up re-compilation by @Xiaoming-AMD in #199
feat: align primus-turbo fp8 linear's args to megatron by @RuibinCheung in #195
Add wandb_enable config and Torchtitan unit tests by @zitree in #194
fix: wrapper turbo quant config in megatron extension by @RuibinCheung in #202
feat(cli): add Python-based primus entrypoint for PATH installation by @Xiaoming-AMD in #200
feat(zero-bubble): support zero bubble pipeline parallism by @ChengYao-amd in #208
Primus product matrix by @wenxie-amd in #210
fix: remove MXQuantConfig from titan and add warning msg by @RuibinCheung in #212
fix 8B perf regression (v25.9) by @wenxie-amd in #215
feat(zero-bubble): support GroupGemm wgrad split, add debug_scheduler_table flag by @ChengYao-amd in #213
add support for grok1 by @JohnQinAMD in #216
improve torch profiling by @wenxie-amd in #218
supports: userId for request by @weilei0120 in #214
support mlflow tracking by @wenxie-amd in #219
feat: Update Megatron-LM to 8477817(20251011) by @Xiaoming-AMD in #221
test(megatron): add Qwen2.5-7B and Qwen2.5-72B pretrain cases by @Xiaoming-AMD in #222
feat(CLI): add unified shell entry scripts for Slurm, container, and direct modes by @Xiaoming-AMD in #209
Add tensor size print for comm op benchmark by @lorri-rao in #223
fix(megatron): fix bugs for fitting the newest megatron by @ChengYao-amd in #224
Docker Release v25.9 by @wenxie-amd in #217
Add grok2 model support by @wenxie-amd in #227
Use PRIMUS_xxx env, export all envs for slurm by @wenxie-amd in #229
feat(deepep): add PrimusTurboDeepEPTokenDispatcher and support syncfree moe stage 0-2 by @zhenhuang12 in #220
upgrade(torchtitan): sync torchtitan to 5fb7cc2e3bbb9b9dc0ab7af34ed5cc58b5f32021 (2025-10-16) by @Xiaoming-AMD in #228
chore(docker): update default image to rocm/primus:v25.9_gfx942 by @Xiaoming-AMD in #230
fix(tests): add missing expecttest dependency for distributed tests by @Xiaoming-AMD in #233
fix(config): use 1.0e-2 for moe_aux_loss_coeff to ensure correct float parsing by @Xiaoming-AMD in #234

New Contributors

@clairesonglee made their first contribution in #188
@zitree made their first contribution in #194
@lorri-rao made their first contribution in #223

Full Changelog: v0.2.0...v0.4.0

Contributors

clairesonglee, lorri-rao, and 10 other contributors

Assets 2

15 Oct 00:47

wenxie-amd

v0.3.0 Latest

Latest

What's Changed

fix(config): correct flavor to 405B in torchtitan/llama3.1_405B.yaml by @Xiaoming-AMD in #189
perf(torchtitan/config): enable compile for Llama-3.1 (8B/70B/405B) by @Xiaoming-AMD in #193
disable dump_pp_data when pp size is one by @lhzhang333 in #191
remove turbo token by @wenxie-amd in #197
feat(async-tp) change gemm_rs_overlap api for multi-stream method by @llying-001 in #171
Support for torchtitan with Primus-Turbo by @clairesonglee in #188
chore: update default rocm/megatron-lm image to v25.8_py310 by @Xiaoming-AMD in #198
perf(aiter): add AITER_JIT_DIR env for cached build to speed up re-compilation by @Xiaoming-AMD in #199
feat: align primus-turbo fp8 linear's args to megatron by @RuibinCheung in #195
Add wandb_enable config and Torchtitan unit tests by @zitree in #194
fix: wrapper turbo quant config in megatron extension by @RuibinCheung in #202
feat(cli): add Python-based primus entrypoint for PATH installation by @Xiaoming-AMD in #200
feat(zero-bubble): support zero bubble pipeline parallism by @ChengYao-amd in #208
Primus product matrix by @wenxie-amd in #210
fix: remove MXQuantConfig from titan and add warning msg by @RuibinCheung in #212
fix 8B perf regression (v25.9) by @wenxie-amd in #215
feat(zero-bubble): support GroupGemm wgrad split, add debug_scheduler_table flag by @ChengYao-amd in #213

New Contributors

@clairesonglee made their first contribution in #188
@zitree made their first contribution in #194

Full Changelog: v0.2.0...v0.3.0

Contributors

clairesonglee, wenxie-amd, and 6 other contributors

Assets 2

11 Sep 05:00

wenxie-amd

v0.2.0

What's Changed

feat: Unify config/backend CLI & add config export support by @Xiaoming-AMD in #151
feat: reduce cpu sync of moe_router_force_load_balancing by @RuibinCheung in #153
feat(light-megatron): add LightMegatronPretrainTrainer with clean config-based integration by @Xiaoming-AMD in #136
fix(docker): Use docker_podman_proxy for container cleanup by @Xiaoming-AMD in #157
feat(moe): fused moe router add scatter logics, modify flags to primus_turbo.yaml by @ChengYao-amd in #141
feat(turbo): update turbo grouped gemm bf16/fp16 by @xiaobochen-amd in #149
fix(pp): fix the validation issue when vpp is not set in manual split mode by @lhzhang333 in #161
Add initial llama4 configs by @chriscai-amd in #163
（ut）add megatron ut scripts by @llying-001 in #164
refactor(attn): update attention utils interface by @ChengYao-amd in #159
Update Llama-4-Scout-17B-16E Megatron Configs by @chriscai-amd in #165
update log/wandb/tensorboard by @wenxie-amd in #169
[Llama4] Add Llama4 17B128E Maverick config by @chriscai-amd in #172
feat(turbo): attn interface fit turbo by @ChengYao-amd in #173
turn on manual gc by @wenxie-amd in #175
add userid to header by @weilei0120 in #177
（feat)async tp: adapt async-tp for te2.x api by @llying-001 in #178
[Perf Issue] Disable manual_gc by default and update rocm_mem behavior by @wenxie-amd in #179
update proxy model config by @wenxie-amd in #167
upgrade docker image by @wenxie-amd in #176
Enable turbo v25.8 by @vidushi8 in #180
fix wandb/tensorboard mem item by @wenxie-amd in #181
（test) add torchtitan ut and integration test by @llying-001 in #170
add te fused cross entropy argument by @wenxie-amd in #182
make pp_data_dir configurable and add pp_vis dependencies by @lhzhang333 in #183
pp_warmup optimization by @lhzhang333 in #185
move clean step into UT by @wenxie-amd in #186

New Contributors

@chriscai-amd made their first contribution in #163
@weilei0120 made their first contribution in #177

Full Changelog: v0.1.0-rc1...v0.2.0

Contributors

vidushi8, wenxie-amd, and 8 other contributors

Assets 2

13 Aug 07:59

wenxie-amd

First Version Release

What's Changed

hipblaslt auto tune by @wenxie-amd in #23
fix(examples): fix Megatron path used in prepare_dataset by @Xiaoming-AMD in #24
fix(torch_fsdp): get ddp_config failed when use torch_fsdp by @Xiaoming-AMD in #27
preflight by @wenxie-amd in #25
merge trace file by @wenxie-amd in #31
add inter-node ring p2p test by @limou102 in #30
feat(megatron): Align TFLOPs calculation for megatron by @Xiaoming-AMD in #28
[Fix] fp8 option not work by @RuibinCheung in #33
feat(HSA ENV): tune ROCm runtime with HSA_NO_SCRATCH_RECLAIM and HSA_ENABLE_SDMA by @Xiaoming-AMD in #32
feat(config parse): replace yaml value(int/float ) from env('KEY') by @Xiaoming-AMD in #34
feat(fsdp): patch Megatron torch_FSDP2 with Primus implementation by @Xiaoming-AMD in #35
Add README by @wenxie-amd in #36
fix typo error of preflight script by @wenxie-amd in #37
[Feat] Add tensile tuning example by @RuibinCheung in #38
refactor(examples): simplify usage and improve structure for clarity by @Xiaoming-AMD in #39
test:Add model-specific Megatron trainer test cases with isolated test config by @Xiaoming-AMD in #40
Primus benchmark by @xiaobochen-amd in #43
docs(contributing): add initial contributing guidelines by @Xiaoming-AMD in #42
Dev/yaoc/mixtral by @ChengYao-amd in #44
fix fast async checkpoint on ROCm by @limou102 in #46
docs & refactor: reorganize README, unify config usage, and improve Megatron pretrain scripts for Primus by @Xiaoming-AMD in #45
optimize: reduce FP8 training memory usage via scoped TE layer config overrides by @Xiaoming-AMD in #47
fix(megatron): add missing import for 'inspect' in TE kwargs patch by @Xiaoming-AMD in #49
chore(submodule): update Megatron-LM from 20250324 to 20250522 by @Xiaoming-AMD in #50
refactor: improve benchmark runner and report parser with multi-node support by @Xiaoming-AMD in #51
feature(model): add mixtral pretrain config by @ChengYao-amd in #52
update trace_moe_metric call to fit new megatron interface by @ChengYao-amd in #53
fix(Megatron): fix interleaved virtual pipeline training error and add corresponding UT by @lhzhang333 in #54
opt(UT): add num_workers=1 in UT yaml to save most of the time on exit by @lhzhang333 in #56
Update mixtral pretrain configs by @yuankaichen-amd in #55
refactor(docker): Update docker image to v25.5_py310 by @wenxie-amd in #57
feat(config): Update LLaMA pretrain configs by @Xiaoming-AMD in #58
feature(RDMA): Add filtering for gpu RDMA network adapters by @chaojhou in #59
fix(trainer-test): improve training script success detection using stdout … by @Xiaoming-AMD in #63
chore(license): add MIT LICENSE file for Primus by @Xiaoming-AMD in #61
refactor: move Megatron run scripts to examples root and add --backend parameter for multi-backend support by @Xiaoming-AMD in #64
feat(torchtitan): Add TorchTitan Backend Support (Initial Stub) by @Xiaoming-AMD in #65
feat(torchtitan): add --local-ranks-filter support in torchrun launcher by @Xiaoming-AMD in #67
fix(slurm): remove --reservation flag and quote variables in run_slurm_pretrain.sh by @Xiaoming-AMD in #68
feat(megatron): enable manual pipeline split in (interleaved) 1F1B-PP by monkey patching by @lhzhang333 in #69
rebase main to instella branch by @wenxie-amd in #71
fix(ip-interface): socket interface env regression by @Xiaoming-AMD in #70
feat: Add run_k8s_pretrain interface for Kubernetes workload submission by @Xiaoming-AMD in #72
feat(run_k8s_pretrain): support --workspace and improve job spec defaults by @Xiaoming-AMD in #73
feat(megatron): support mock_data mode to skip dataset preparation by @Xiaoming-AMD in #74
feat(k8s_pretrain): support log in stdout and file by @chaojhou in #76
feat(k8s): Support for Node Selection via --nodelist and Add nodes by @Xiaoming-AMD in #75
docs: add TorchTitan backend support entry to README by @Xiaoming-AMD in #78
add benchmark for checkpoint saving by @limou102 in #81
feat(torchtitan):Add model configs for LLaMA3-405B and LLaMA3-70B (TorchTitan) by @Xiaoming-AMD in #82
feat(tp-overlap): add te backend and support tp overlap for megatron. by @zhenhuang12 in #79
feat(benchmark): update kernel benchmark and add llama405B config by @xiaobochen-amd in #77
llama3.1_405B model config by @wenxie-amd in #84
print training envs by @wenxie-amd in #85
add checkpoint loading metrics by @limou102 in #86
feat: add new ckpt args of megatron by @wenxie-amd in #88
doc: Add Mistral Models and Fix Formatting in examples/README.md by @Xiaoming-AMD in #87
refactor(cli): Enhance Primus CLI with --override Support & Simplify Platform Defaults by @Xiaoming-AMD in #89
chore(license): add AMD license headers by @Xiaoming-AMD in #90
feat(k8s launch):Support forwarding unrecognized --args to ENTRY_POINT by @Xiaoming-AMD in #91
fix(megatron): sync initialize_megatron of primus with that of megatron by @lhzhang333 in #93
enable deepseek qk_layernorm by @wenxie-amd in #94
checkout Primus-Turbo by github secret by @wenxie-amd in #96
feat(tp-overlap): support torchtitan by patch fused_all_gather_matmul of torch op by @zhenhuang12 in #92
add deprecated_20251209 moe layer by @wenxie-amd in #98
feat(megatron): add attn warmup to save iter1's time when pp is used by @lhzhang333 in #97
Primus Config/Patch Document by @wenxie-amd in #100
feat(megatron): enable dumping pp schedule data and add pp visualization tool by @lhzhang333 in #99
add patch readme for attn_warmup and decoder_pipeline_manual_split_list by @lhzhang333 in #101
feat(megatron): add model and pretrain config for LLaMA3.1-405B by @Xiaoming-AMD in #102
refactor: Refactor Torchtitan Config & Launch: YAML Unification, Backend Auto-Selection by @Xiaoming-AMD in #106
refactor(torchtitan): switch llama3 configs from TOML to YAML by @Xiaoming-AMD in #108
doc(examples): Rename Torchtitan LLaMA3 Configs to LLaMA3.1 and Update README Links by @Xiaoming-AMD in #110
Add tas k8s runner's ci file by @haishuok0525 in #109
test(megatron): add Mixtral-8x22B/Mixtral-8x7B test and TRAIN_LOG override support by @Xiaoming-AMD in #114
Speedup primus-turbo build in k8s-ci runner by @wenxie-amd in #113
fix(trainer): auto-enable tensorboard when profiling is enabled by @Xiaoming-AMD in #116
Code isolation from shared path by @haishuok0525 in #119
feat(megatron): add moe_use_fused_router_with_aux_score by @ChengYao-amd in #111
[UT] Add deterministic extra check and unit test by @RuibinCheung in h...

Read more

Contributors

vidushi8, wenxie-amd, and 12 other contributors

Assets 2