Multinode projection with different parallelization strategies when single node is benchmarked#492
Multinode projection with different parallelization strategies when single node is benchmarked#492araina-amd wants to merge 25 commits intomainfrom
Conversation
araina-amd
commented
Jan 14, 2026
- Multinode scaling projection from baseline to target node count
- Automatic config reduction for single-node benchmarking (PP and EP rescaling)
- Integration with pipeline simulation for accurate baseline calculation
- Per-layer communication estimation (TP AllReduce, MoE All-to-All)
- Detailed communication breakdown with message sizes
- Support for overlapped gradient all-reduce (default enabled)
|
Let's separate the style/formatting changes from the actual changes and make it into two PRs (if formatting is actually necessary). |
|
To the actual changes, there are several key things missing from the code. Let's discuss it offline. |
53259c7 to
23ea7c6
Compare
|
LGTM in general, I left some comments in the code as well as below:
|
|
Thanks, Anshu! I have some minor comments: (1) please fix the bot's findings -- mostly for unused variables; |
- Multinode scaling projection from baseline to target node count - Automatic config reduction for single-node benchmarking (PP and EP rescaling) - Integration with pipeline simulation for accurate baseline calculation - Per-layer communication estimation (TP AllReduce, MoE All-to-All) - Detailed communication breakdown with message sizes - Support for overlapped gradient all-reduce (default enabled)
…nce_projection - Delete primus/core/projection/multinode_projection/ directory - All multinode projection functionality is now in performance_projection/projection.py - Communication calculation, hardware config loading, and projection logic consolidated
…he benchmarked time.
…y accounted in the pipeline simulation model.
…a_parallel_size to use PROJECTION_NNODES, fixed wgrad double-counting (set to 0.0), removed wgrad additions for IO layers, and added zero-bubble scheduler support with 50/50 B/W split when enable_zero_bubble=True.
…d _run_pipeline_simulation_megatron_zb() to use actual Megatron zero-bubble scheduler (ILP-based) instead of simple heuristic scheduler. Add custom_hardware_example.yaml for hardware configuration. Plus fixing some prints. Usage: bash runner/primus-cli direct --script primus/cli/main.py -- projection performance --config examples/megatron/configs/MI300X/deepseek_v2_lite-BF16-pretrain.yaml --target-nodes 6 Projection accuracy for DeepSeek V2 Lite: - PP=3, EP=8 (3 nodes): Projected 6628ms vs Measured 6468ms = +2.5% error - PP=1, EP=16 (2 nodes): Projected 5337ms vs Measured 5276ms = +1.2% error
45b1952 to
5d6ac43
Compare
- Fix import spacing (add blank lines after imports) - Fix string quotes (single to double quotes) - Fix trailing whitespace - Fix function spacing (add blank lines between functions) - Format all affected files to pass CI black check
|
LGTM. @wenxie-amd can you please give it a review? |
… and fix total_gpus calculation in collective args.
…ble FSDP2 for benchmarking, and print tokens/s/GPU and summary at the end
…tion - Add FSDP2 communication calculation (all-gather forward, reduce-scatter backward) * Calculate per-layer weight sharding communication overhead * Account for FSDP overlap efficiency (~70% for multi-node) * Expose communication at layer boundaries (first/last layer) - Add recomputation overhead for gradient checkpointing * Support recompute_granularity='full' with recompute_num_layers * Track forward times separately for dense and MoE layers * Add recomputation overhead to baseline time calculation - Improve gradient all-reduce handling * Only calculate gradient all-reduce when NOT using FSDP2 * FSDP2 uses reduce-scatter instead of all-reduce for gradients * Maintain backward compatibility with distributed_optimizer (ZeRO-1) - Add training config fields for overlap and recomputation * overlap_grad_reduce, overlap_param_gather * recompute_granularity, recompute_num_layers - Add edge case handling for microbatch calculation * Warn when global_batch_size < micro_batch × target_DP * Handle zero microbatch case gracefully - Improve output formatting * Show forward/backward breakdown in layer timing * Display recomputation overhead when enabled * Better FSDP communication reporting
…config updates
Changes:
1. language_model.py:
- Support imbalanced layer distribution when layers not divisible by PP*VPP
(e.g., 61 layers distributed as [16,15,15,15] for 4 stages)
- Add _get_balanced_layer_distribution() for automatic remainder distribution
- Add _get_explicit_layer_distribution() for decoder_first/last_pipeline_num_layers
- Fix recompute activation memory calculation
2. attention.py:
- Fix MLA activation memory calculation with proper query_projection_size
3. training_config.py:
- Fix moe_layer_freq parsing when it's a string integer (e.g., '1')
- Use model's padded_vocab_size instead of hardcoded 100352
4. deepseek_v3.yaml:
- Enable multi_latent_attention: true
- Add num_shared_experts: 1
- Add padded_vocab_size: 163840
5. qwen3_235B_A22B.yaml:
- Add num_shared_experts: 1
- Add moe_shared_expert_intermediate_size: 1536
- Add padded_vocab_size: 151936
6. Add new model configs: kimi_k2.yaml (1039B MoE+MLA) and glm4_7.yaml (345B MoE+GQA)
…istribution - Add `from __future__ import annotations` to parser.py for older Python compatibility - Change `tuple[...]` to `Tuple[...]` type hint syntax - Add `_get_balanced_layer_distribution()` for even layer distribution across PP stages - Add `_get_explicit_layer_distribution()` to support decoder_first/last_pipeline_num_layers - Update `get_layers_for_rank()` to handle imbalanced PP configurations