Multinode projection with different parallelization strategies when single node is benchmarked by araina-amd · Pull Request #492 · AMD-AGI/Primus

araina-amd · 2026-01-14T23:49:10Z

Multinode scaling projection from baseline to target node count
Automatic config reduction for single-node benchmarking (PP and EP rescaling)
Integration with pipeline simulation for accurate baseline calculation
Per-layer communication estimation (TP AllReduce, MoE All-to-All)
Detailed communication breakdown with message sizes
Support for overlapped gradient all-reduce (default enabled)

primus/core/projection/performance_projection/projection.py

primus/core/projection/module_profilers/collective_model.py

primus/core/projection/module_profilers/collective_args.py

primus/core/projection/module_profilers/collective_model.py

primus/core/projection/multinode_projection/projection.py

primus/core/projection/performance_projection/projection.py

primus/modules/trainer/megatron/trainer.py

primus/core/projection/performance_projection/projection.py

primus/backends/transformer_engine/pytorch/module/base.py

primus/core/projection/module_profilers/collective_model.py

primus/core/projection/module_profilers/collective_args.py

primus/core/projection/module_profilers/collective_model.py

yuankaichen-amd · 2026-01-15T01:16:22Z

Let's separate the style/formatting changes from the actual changes and make it into two PRs (if formatting is actually necessary).

yuankaichen-amd · 2026-01-15T01:47:42Z

To the actual changes, there are several key things missing from the code. Let's discuss it offline.

primus/core/projection/performance_projection/projection.py

primus/core/projection/module_profilers/collective_args.py

primus/core/projection/module_profilers/collective_model.py

primus/core/projection/performance_projection/projection.py

primus/core/projection/module_profilers/language_model.py

yuankaichen-amd · 2026-01-20T01:31:35Z

LGTM in general, I left some comments in the code as well as below:

Baseline (time, nodes) in the CLI input and its related code is not very useful. Since it is only used in printing results, I suggest we should remove those.
Please make PROJECTION_NNODES=4 as a CLI flag, if not specified, default to the baseline_nodes which is to be calculated based on pp/tp/ep/... in the config
Document an example of hardware config in the CLI level and what should be included. If user doesn't provide one, what are we using? Is the collective model able to select config numbers based on GPUs/Nics it detects on the node?

primus/core/projection/performance_projection/projection.py

primus/core/projection/module_profilers/collective_args.py

primus/core/projection/performance_projection/projection.py

primus/core/projection/module_profilers/collective_model.py

yuankaichen-amd · 2026-01-26T17:26:40Z

Thanks, Anshu! I have some minor comments:

(1) please fix the bot's findings -- mostly for unused variables;
(2) can we move some of the functions in the performance.py to separate files? it would be good for readability.
(3) please also add a readme file.

primus/core/projection/performance_projection/projection.py

- Multinode scaling projection from baseline to target node count - Automatic config reduction for single-node benchmarking (PP and EP rescaling) - Integration with pipeline simulation for accurate baseline calculation - Per-layer communication estimation (TP AllReduce, MoE All-to-All) - Detailed communication breakdown with message sizes - Support for overlapped gradient all-reduce (default enabled)

…nce_projection - Delete primus/core/projection/multinode_projection/ directory - All multinode projection functionality is now in performance_projection/projection.py - Communication calculation, hardware config loading, and projection logic consolidated

…he benchmarked time.

…y accounted in the pipeline simulation model.

…a_parallel_size to use PROJECTION_NNODES, fixed wgrad double-counting (set to 0.0), removed wgrad additions for IO layers, and added zero-bubble scheduler support with 50/50 B/W split when enable_zero_bubble=True.

…d _run_pipeline_simulation_megatron_zb() to use actual Megatron zero-bubble scheduler (ILP-based) instead of simple heuristic scheduler. Add custom_hardware_example.yaml for hardware configuration. Plus fixing some prints. Usage: bash runner/primus-cli direct --script primus/cli/main.py -- projection performance --config examples/megatron/configs/MI300X/deepseek_v2_lite-BF16-pretrain.yaml --target-nodes 6 Projection accuracy for DeepSeek V2 Lite: - PP=3, EP=8 (3 nodes): Projected 6628ms vs Measured 6468ms = +2.5% error - PP=1, EP=16 (2 nodes): Projected 5337ms vs Measured 5276ms = +1.2% error

primus/core/projection/module_profilers/dense_mlp.py

primus/core/projection/module_profilers/language_model.py

- Fix import spacing (add blank lines after imports) - Fix string quotes (single to double quotes) - Fix trailing whitespace - Fix function spacing (add blank lines between functions) - Format all affected files to pass CI black check

primus/core/projection/module_profilers/collective_args.py

primus/core/projection/module_profilers/collective_model.py

primus/core/projection/performance_projection/projection.py

primus/core/projection/module_profilers/collective_model.py

yuankaichen-amd · 2026-01-29T00:33:57Z

LGTM. @wenxie-amd can you please give it a review?

… and fix total_gpus calculation in collective args.

…ble FSDP2 for benchmarking, and print tokens/s/GPU and summary at the end

…tion - Add FSDP2 communication calculation (all-gather forward, reduce-scatter backward) * Calculate per-layer weight sharding communication overhead * Account for FSDP overlap efficiency (~70% for multi-node) * Expose communication at layer boundaries (first/last layer) - Add recomputation overhead for gradient checkpointing * Support recompute_granularity='full' with recompute_num_layers * Track forward times separately for dense and MoE layers * Add recomputation overhead to baseline time calculation - Improve gradient all-reduce handling * Only calculate gradient all-reduce when NOT using FSDP2 * FSDP2 uses reduce-scatter instead of all-reduce for gradients * Maintain backward compatibility with distributed_optimizer (ZeRO-1) - Add training config fields for overlap and recomputation * overlap_grad_reduce, overlap_param_gather * recompute_granularity, recompute_num_layers - Add edge case handling for microbatch calculation * Warn when global_batch_size < micro_batch × target_DP * Handle zero microbatch case gracefully - Improve output formatting * Show forward/backward breakdown in layer timing * Display recomputation overhead when enabled * Better FSDP communication reporting

…config updates Changes: 1. language_model.py: - Support imbalanced layer distribution when layers not divisible by PP*VPP (e.g., 61 layers distributed as [16,15,15,15] for 4 stages) - Add _get_balanced_layer_distribution() for automatic remainder distribution - Add _get_explicit_layer_distribution() for decoder_first/last_pipeline_num_layers - Fix recompute activation memory calculation 2. attention.py: - Fix MLA activation memory calculation with proper query_projection_size 3. training_config.py: - Fix moe_layer_freq parsing when it's a string integer (e.g., '1') - Use model's padded_vocab_size instead of hardcoded 100352 4. deepseek_v3.yaml: - Enable multi_latent_attention: true - Add num_shared_experts: 1 - Add padded_vocab_size: 163840 5. qwen3_235B_A22B.yaml: - Add num_shared_experts: 1 - Add moe_shared_expert_intermediate_size: 1536 - Add padded_vocab_size: 151936 6. Add new model configs: kimi_k2.yaml (1039B MoE+MLA) and glm4_7.yaml (345B MoE+GQA)

primus/core/projection/module_profilers/language_model.py

…istribution - Add `from __future__ import annotations` to parser.py for older Python compatibility - Change `tuple[...]` to `Tuple[...]` type hint syntax - Add `_get_balanced_layer_distribution()` for even layer distribution across PP stages - Add `_get_explicit_layer_distribution()` to support decoder_first/last_pipeline_num_layers - Update `get_layers_for_rank()` to handle imbalanced PP configurations

araina-amd requested review from Xiaoming-AMD, limou102 and wenxie-amd as code owners January 14, 2026 23:49

github-code-quality bot found potential problems Jan 14, 2026

View reviewed changes

github-code-quality bot found potential problems Jan 15, 2026

View reviewed changes

araina-amd marked this pull request as draft January 15, 2026 19:20

araina-amd changed the title ~~Multinode projection with different parallelization strategies when single node is benchmarked~~ [WIP] Multinode projection with different parallelization strategies when single node is benchmarked Jan 15, 2026

araina-amd force-pushed the dev/araina/multinode_performance_model branch from 53259c7 to 23ea7c6 Compare January 16, 2026 00:46

github-code-quality bot found potential problems Jan 16, 2026

View reviewed changes

araina-amd requested a review from yuankaichen-amd January 16, 2026 01:16

github-code-quality bot found potential problems Jan 20, 2026

View reviewed changes

yuankaichen-amd reviewed Jan 20, 2026

View reviewed changes

primus/core/projection/module_profilers/language_model.py Show resolved Hide resolved

yuankaichen-amd reviewed Jan 20, 2026

View reviewed changes

primus/core/projection/module_profilers/language_model.py Outdated Show resolved Hide resolved

yuankaichen-amd reviewed Jan 20, 2026

View reviewed changes

primus/core/projection/module_profilers/language_model.py Outdated Show resolved Hide resolved

github-code-quality bot found potential problems Jan 21, 2026

View reviewed changes

github-code-quality bot found potential problems Jan 22, 2026

View reviewed changes

araina-amd changed the title ~~[WIP] Multinode projection with different parallelization strategies when single node is benchmarked~~ Multinode projection with different parallelization strategies when single node is benchmarked Jan 24, 2026

araina-amd marked this pull request as ready for review January 24, 2026 01:47

github-code-quality bot found potential problems Jan 24, 2026

View reviewed changes

github-code-quality bot found potential problems Jan 28, 2026

View reviewed changes

primus/core/projection/performance_projection/projection.py Fixed Show fixed Hide fixed

primus/core/projection/performance_projection/projection.py Fixed Show fixed Hide fixed

primus/core/projection/performance_projection/projection.py Fixed Show fixed Hide fixed

araina-amd added 6 commits January 27, 2026 17:00

Remove the addition of all to all time as it is already included in t…

5be7f4c

…he benchmarked time.

Remove the p2p prints coming from the collective model. p2p is alread…

61cb3c8

…y accounted in the pipeline simulation model.

araina-amd added 3 commits January 27, 2026 17:00

Correct microbatch calculation for MOE and improve output consistency.

f39447d

Decouple DP scaling from EP and use consistent benchmark baseline.

6b36a0d

Remove unused variables.

5d6ac43

araina-amd force-pushed the dev/araina/multinode_performance_model branch from 45b1952 to 5d6ac43 Compare January 28, 2026 01:12

github-code-quality bot found potential problems Jan 28, 2026

View reviewed changes

primus/core/projection/module_profilers/dense_mlp.py Fixed Show fixed Hide fixed

primus/core/projection/module_profilers/language_model.py Fixed Show fixed Hide fixed

araina-amd added 3 commits January 27, 2026 18:08

Add readme for multinode projection. Remove the unused names.

b325ba9

Actually adding the projection readme.

9bb0e86

github-code-quality bot found potential problems Jan 28, 2026

View reviewed changes

araina-amd added 2 commits January 27, 2026 18:55

Fix unused local variable warnings from CodeQL

232a108

Fix import sorting with isort

7b841b4

github-code-quality bot found potential problems Jan 28, 2026

View reviewed changes

primus/core/projection/module_profilers/collective_model.py Fixed Show fixed Hide fixed

araina-amd and others added 2 commits January 27, 2026 19:11

Fix import order and remove unused Tuple import

27a7948

Merge branch 'main' into dev/araina/multinode_performance_model

2fe1af6

araina-amd and others added 7 commits January 28, 2026 19:12

Streamline output by removing redundant sections, add tokens/s metric…

6bcba3c

… and fix total_gpus calculation in collective args.

Fix the trailing whitespace detected by black.

9ce7a9a

Fix all-reduce overlap calculation, add rank-0 output filtering, disa…

0493d5c

…ble FSDP2 for benchmarking, and print tokens/s/GPU and summary at the end

Fix the format with black.

ef8cbae

Fix the line-length as 110.

321320b

github-code-quality bot found potential problems Feb 6, 2026

View reviewed changes

araina-amd and others added 2 commits February 6, 2026 16:18

Fix format and remove unused variables.

eeab1f4

Conversation

araina-amd commented Jan 14, 2026

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

yuankaichen-amd commented Jan 15, 2026

Uh oh!

yuankaichen-amd commented Jan 15, 2026

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

yuankaichen-amd commented Jan 20, 2026

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

yuankaichen-amd commented Jan 26, 2026

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

yuankaichen-amd commented Jan 29, 2026

Uh oh!

Uh oh!

Uh oh!