[Common] Improved fused MoE aux loss kernel for large # of experts by denera · Pull Request #2758 · NVIDIA/TransformerEngine

denera · 2026-03-13T11:19:38Z

Description

Eliminates expensive cluster management API and minimizes number of atomic ops to optimize perf for larger number of experts.

TODO: Perf testing on all archs.

Documentation change (change only to the documentation, either a fix or a new content)
Bug fix (non-breaking change which fixes an issue)
New feature (non-breaking change which adds functionality)
Breaking change (fix or feature that would cause existing functionality to not work as expected)
Infra/Build change
Code refactoring

Signed-off-by: Alp Dener <adener@nvidia.com>

for more information, see https://pre-commit.ci

added new implementation of fused_moe_aux_loss_forward kernel

8b866e3

Signed-off-by: Alp Dener <adener@nvidia.com>

denera self-assigned this Mar 13, 2026

denera added the 2.15 label Mar 13, 2026

[pre-commit.ci] auto fixes from pre-commit.com hooks

1071b6b

for more information, see https://pre-commit.ci