Skip to content

Commit 1af0941

Browse files
levendleefacebook-github-bot
authored andcommitted
Adds baseline implementation for MetaShuffling and some cleanups. (pytorch#4080)
Summary: X-link: facebookresearch/FBGEMM#1164 Adds baseline implementation for MetaShuffling and some cleanups. Differential Revision: D74101069
1 parent 6e9f1e0 commit 1af0941

File tree

5 files changed

+399
-177
lines changed

5 files changed

+399
-177
lines changed

fbgemm_gpu/experimental/gen_ai/README.md

+1-1
Original file line numberDiff line numberDiff line change
@@ -59,7 +59,7 @@ pip install fbgemm-gpu-genai
5959

6060
## 2.2 **Llama4 MoE support**
6161

62-
More coming soon in [token shuffling](gen_ai/moe/README.md) kernels.
62+
More coming soon in [MetaShuffling](gen_ai/moe/README.md) kernels.
6363

6464
# 3. **Llama 3 Related External Coverage**
6565

Original file line numberDiff line numberDiff line change
@@ -1,9 +1,15 @@
11
# FBGEMM GenAI MoE Support
22

3-
MoE Token Shuffling Kernel support in FBGEMM GenAI Kernel Library.
3+
MetaShuffling MoE kernel support in FBGEMM GenAI kernel library.
44

5-
# **1. Overview**
5+
# **Overview**
66

7-
Mixture-of-Experts (MoE) is a popular model architecture for large language models (LLMs). Although it reduces computation in training and inference by activating less parameters per token, it imposes additional challenges in achieving optimal computation efficiency with high memory and communication pressure, as well as the complexity to handle the dynamism and sparsity nature of the model. Here we introduce a new MoE inference solution, token shuffling, which enables us to efficiently deploy Llama 4 models for real scenario inference.
7+
Mixture-of-Experts (MoE) is a popular model architecture for large language models (LLMs). Although it reduces computation in training and inference by activating less parameters per token, it imposes additional challenges in achieving optimal computation efficiency with high memory and communication pressure, as well as the complexity to handle the dynamism and sparsity nature of the model. Here we introduce a new MoE inference solution, MetaShuffling, which enables us to efficiently deploy Llama 4 models for real scenario inference.
88

99
More technical design will be coming soon.
10+
11+
# **Updates**
12+
13+
- 2025-05-01: Initial version of MetaShuffling MoE pytorch example release.
14+
15+
- 2025-04-17: Initial version of MetaShuffling MoE GPU kernels release.

0 commit comments

Comments
 (0)