[TRACKER] Customer support related PR tracker for Intel devices #6556
Open
Description
This issue acted as a PR tracker to Intel customer support related PRs. The purpose is to get understanding of what each PR does and how important are they compared to other customer support related PRs. This also help us to aware of merged PRs and PRs progress.
Under review
- sequence parallel for uneven heads: sequence parallel for uneven heads #6392 (Open)
- Enabled configurable auto Tensor Parallelism (TP) for the inference of diverse models: Enabled configurable auto Tensor Parallelism (TP) for the inference of diverse models #6553 (Open)
Already merged
o MoE
- support bf16_optimizer moe expert parallel training and moe EP grad_scale/grad_norm fix: support bf16_optimizer moe expert parallel training and moe EP grad_scale/grad_norm fix #5259
- Fix a convergence issues in TP topology caused by incorrect grad_norm: Fix a convergence issues in TP topology caused by incorrect grad_norm. #5411
- add moe topk(k>2) gate support: add moe topk(k>2) gate support #5881
- reduce cpu host overhead when using moe: reduce cpu host overhead when using moe #5578
o Ulysess
- fix sequence parallel(Ulysses) grad scale for zero0: fix sequence parallel(Ulysses) grad scale for zero0 #5555
- sequence parallel with communication overlap: sequence parallel with communication overlap #5691
o AutoTP
- autoTP for fused qkv weight: autoTP for fused qkv weight #3844
- autoTP for Qwen: autoTP for Qwen #4902
- Enabled Qwen2-MoE Tensor Parallelism (TP) inference: Enabled Qwen2-MoE Tensor Parallelism (TP) inference #6551 (Open)
o Accelerator Graph
- Capture short kernel sequences to graph: Capture short kernel sequences to graph #4318
o ZeRO
- params partition for skip_init: params partition for skip_init #4722
o Others
- skip bcast when enable pp but pp_group_size=1: skip bcast when enable pp but pp_group_size=1 #3915
- remove duplicate check for pp and zero stage: remove duplicate check for pp and zero stage #4033
- update ut/doc for glm/codegen: update ut/doc for glm/codegen #4057
- do allgather only in shared optimizer states groups: do allgather only in shared optimizer states groups #4167
- use non_reentrant_checkpoint fix requires_grad of input must be true for activation checkpoint layer in pipeline train.: use
non_reentrant_checkpoint
fix requires_grad of input must be true for activation checkpoint layer in pipeline train. #4224 - clear redundant parameters in zero3 bwd hook: clear redundant parameters in zero3 bwd hook #4520
- set the default to use set_to_none for clearing gradients in BF16 optimizer.: set the default to use set_to_none for clearing gradients in BF16 optimizer. #5434
- Use deepspeed.comm instead of torch.distributed: Use
deepspeed.comm
instead oftorch.distributed
#5225 - Use torch.nan_to_num replace numpy wrapper one: Use
torch.nan_to_num
replace numpy wrapper one #5877 - [bugfix] promote state in bf16_optimizer: [bugfix] promote state in bf16_optimizer #5767