-
Notifications
You must be signed in to change notification settings - Fork 346
Description
Problem Statement
Currently, DeepSeek-V3 and GPT-OSS maintain separate MoE implementations with significant architectural differences:
DeepSeek-V3:
- MoEGate router with score correction bias and expert scaling
- SharedExpert runs in parallel with MoE experts
- All-gather/reduce-scatter for tensor parallelism at the decoder block level
- All-to-all dispatch/combine operations embedded within MoE module
GPT-OSS:
- TopKRouter with simple linear + topk + softmax logic
- Two expert modes: standard (sparse matmul) and throughput (all-to-all)
- No shared expert component
- Different collective operation patterns
This separation causes:
- Code duplication for similar MoE functionality
- Inability to test different configurations easily
- Maintenance burden for two separate implementations
- Difficulty comparing architectures side-by-side
- No clear path to support other MoE models (Mixtral, Kimi K2, Switch Transformer, etc.)
Proposed Solution
Create a single, configurable MoEBlock module that can be configured via JSON to support both DeepSeek-V3 and GPT-OSS architectures, with extensibility for other MoE models. This is a focused scope reduction from the full tt-moe library proposed in #37354.
Architecture
models/demos/unified_moe/
├── moe_block.py # Main unified MoE block
├── configs/
│ ├── deepseek_v3.json # DeepSeek-V3 configuration
│ ├── gpt_oss.json # GPT-OSS configuration
│ ├── mixtral.json # Mixtral 8x7B configuration
│ ├── switch_transformer.json # Switch Transformer config
│ ├── arctic.json # Arctic residual MoE config
│ └── kimi_k2.json # Hypothetical Kimi K2 config
├── components/
│ ├── routers/
│ │ ├── base_router.py
│ │ ├── topk_router.py # GPT-OSS style
│ │ ├── moe_gate.py # DeepSeek style with corrections
│ │ ├── switch_router.py # Top-1 with capacity
│ │ └── soft_router.py # Soft routing (all experts)
│ └── experts/
│ ├── base_expert.py
│ ├── distributed_expert.py # With all-to-all
│ ├── standard_expert.py # Sparse matmul
│ ├── shared_expert.py # DeepSeek shared expert
│ ├── residual_expert.py # Arctic-style with dense backbone
│ └── geglu_expert.py # GeGLU activation variant
└── tests/
├── test_moe_block.py
├── test_routers.py
├── test_experts.py
└── test_configs.py
Key Design Principles
-
Simplified Tensor Parallelism: TP always follows the pattern: all-gather → compute → reduce-scatter. Only the cluster axis needs configuration.
-
Direct TTNN Calls: All collective operations call ttnn directly, no abstraction layers.
-
JSON-Driven Configuration: All architectural choices controlled via JSON files.
-
Extensible Design: Easy to add new router types, expert architectures, and models.
JSON Configuration Schema
Base Schema
{
"moe_block": {
"name": "model_name",
"architecture_type": "sparse_moe", // or "soft_moe", "residual_moe"
"tensor_parallel": {
"enabled": true,
"cluster_axis": 1 // Which mesh dimension for TP (all-gather/reduce-scatter)
},
"router": {
"type": "topk", // or "moe_gate", "switch", "soft_router"
"config": {
"num_experts": 8,
"num_experts_per_tok": 2,
"hidden_size": 4096,
// Router-specific parameters
}
},
"experts": {
"distributed": {
"enabled": true,
"dispatch_cluster_axis": 0, // Which mesh dimension for EP (all-to-all)
"intermediate_size": 14336,
"hidden_size": 4096,
"num_experts": 8,
"activation": "swiglu" // or "gelu", "geglu", "silu"
},
"shared": {
"enabled": false,
"intermediate_size": 1408,
"parallel_with_moe": true
},
"dense_backbone": { // For residual MoE architectures
"enabled": false,
"intermediate_size": 8192,
"residual_weight": 0.5
}
}
}
}Configuration Examples
DeepSeek-V3 Configuration
{
"moe_block": {
"name": "deepseek_v3_moe",
"architecture_type": "sparse_moe",
"tensor_parallel": {
"enabled": true,
"cluster_axis": 1 // TP on columns of mesh
},
"router": {
"type": "moe_gate",
"config": {
"num_experts": 160,
"num_experts_per_tok": 6,
"n_routed_experts": 160,
"n_group": 8,
"topk_group": 3,
"score_correction_bias": true,
"routed_scaling_factor": 1.0,
"memory_config": "L1_MEMORY_CONFIG"
}
},
"experts": {
"distributed": {
"enabled": true,
"dispatch_cluster_axis": 0, // EP on rows of mesh
"memory_config": "L1_MEMORY_CONFIG",
"intermediate_size": 10944,
"hidden_size": 5120,
"num_experts_per_device": 20,
"activation": "swiglu"
},
"shared": {
"enabled": true,
"intermediate_size": 1408,
"hidden_size": 5120,
"parallel_with_moe": true,
"memory_config": "L1_MEMORY_CONFIG"
},
"dense_backbone": {
"enabled": false
}
}
}
}GPT-OSS Configuration
{
"moe_block": {
"name": "gpt_oss_moe",
"architecture_type": "sparse_moe",
"tensor_parallel": {
"enabled": false // GPT-OSS doesn't use TP at MoE level
},
"router": {
"type": "topk",
"config": {
"num_experts": 128,
"num_experts_per_tok": 8,
"hidden_size": 6144,
"memory_config": "DRAM_MEMORY_CONFIG"
}
},
"experts": {
"distributed": {
"enabled": true,
"dispatch_cluster_axis": 0, // EP dispatch axis
"memory_config": "L1_MEMORY_CONFIG",
"intermediate_size": 20480,
"hidden_size": 6144,
"num_experts": 128,
"weight_dtype": "bfloat4_b",
"activation": "swiglu",
"swiglu_limit": 32768
},
"shared": {
"enabled": false
},
"dense_backbone": {
"enabled": false
}
}
}
}Mixtral Configuration
{
"moe_block": {
"name": "mixtral_8x7b",
"architecture_type": "sparse_moe",
"tensor_parallel": {
"enabled": false
},
"router": {
"type": "topk",
"config": {
"num_experts": 8,
"num_experts_per_tok": 2,
"hidden_size": 4096,
"memory_config": "DRAM_MEMORY_CONFIG"
}
},
"experts": {
"distributed": {
"enabled": true,
"dispatch_cluster_axis": 0,
"intermediate_size": 14336,
"hidden_size": 4096,
"num_experts": 8,
"activation": "swiglu"
},
"shared": {
"enabled": false
},
"dense_backbone": {
"enabled": false
}
}
}
}Kimi K2 Configuration (Hypothetical)
{
"moe_block": {
"name": "kimi_k2",
"architecture_type": "sparse_moe",
"tensor_parallel": {
"enabled": true,
"cluster_axis": 1
},
"router": {
"type": "moe_gate",
"config": {
"num_experts": 256,
"num_experts_per_tok": 6,
"score_correction_bias": true,
"routed_scaling_factor": 1.0,
"load_balancing": {
"enabled": true,
"auxiliary_loss_weight": 0.01,
"z_loss_weight": 0.001
}
}
},
"experts": {
"distributed": {
"enabled": true,
"dispatch_cluster_axis": 0,
"num_experts_per_device": 32,
"activation": "swiglu"
},
"shared": {
"enabled": true,
"parallel_with_moe": true
}
}
}
}Switch Transformer Configuration
{
"moe_block": {
"name": "switch_transformer",
"architecture_type": "sparse_moe",
"tensor_parallel": {
"enabled": false
},
"router": {
"type": "switch",
"config": {
"num_experts": 2048,
"routing_strategy": {
"type": "top_1",
"capacity_factor": 1.25,
"drop_tokens": true
}
}
},
"experts": {
"distributed": {
"enabled": true,
"dispatch_cluster_axis": 0,
"capacity_constrained": true
}
}
}
}Arctic Configuration (Residual MoE)
{
"moe_block": {
"name": "arctic",
"architecture_type": "residual_moe",
"tensor_parallel": {
"enabled": false
},
"router": {
"type": "topk",
"config": {
"num_experts": 128,
"num_experts_per_tok": 2
}
},
"experts": {
"distributed": {
"enabled": true,
"dispatch_cluster_axis": 0,
"intermediate_size": 4864,
"residual_connection": true
},
"dense_backbone": {
"enabled": true,
"intermediate_size": 8192,
"residual_weight": 0.5,
"parallel_with_moe": true
}
}
}
}