[Feature] tt-moe proposal (updated scope for Moe only)

## **Problem Statement**

Currently, DeepSeek-V3 and GPT-OSS maintain separate MoE implementations with significant architectural differences:

**DeepSeek-V3:**
- MoEGate router with score correction bias and expert scaling
- SharedExpert runs in parallel with MoE experts
- All-gather/reduce-scatter for tensor parallelism at the decoder block level
- All-to-all dispatch/combine operations embedded within MoE module

**GPT-OSS:**
- TopKRouter with simple linear + topk + softmax logic
- Two expert modes: standard (sparse matmul) and throughput (all-to-all)
- No shared expert component
- Different collective operation patterns

This separation causes:
- Code duplication for similar MoE functionality
- Inability to test different configurations easily
- Maintenance burden for two separate implementations
- Difficulty comparing architectures side-by-side
- No clear path to support other MoE models (Mixtral, Kimi K2, Switch Transformer, etc.)

## **Proposed Solution**

Create a single, configurable `MoEBlock` module that can be configured via JSON to support both DeepSeek-V3 and GPT-OSS architectures, with extensibility for other MoE models. This is a focused scope reduction from the full tt-moe library proposed in #37354.

## **Architecture**

```
models/demos/unified_moe/
├── moe_block.py           # Main unified MoE block
├── configs/
│   ├── deepseek_v3.json   # DeepSeek-V3 configuration
│   ├── gpt_oss.json       # GPT-OSS configuration
│   ├── mixtral.json       # Mixtral 8x7B configuration
│   ├── switch_transformer.json  # Switch Transformer config
│   ├── arctic.json        # Arctic residual MoE config
│   └── kimi_k2.json       # Hypothetical Kimi K2 config
├── components/
│   ├── routers/
│   │   ├── base_router.py
│   │   ├── topk_router.py      # GPT-OSS style
│   │   ├── moe_gate.py         # DeepSeek style with corrections
│   │   ├── switch_router.py    # Top-1 with capacity
│   │   └── soft_router.py      # Soft routing (all experts)
│   └── experts/
│       ├── base_expert.py
│       ├── distributed_expert.py  # With all-to-all
│       ├── standard_expert.py     # Sparse matmul
│       ├── shared_expert.py       # DeepSeek shared expert
│       ├── residual_expert.py     # Arctic-style with dense backbone
│       └── geglu_expert.py        # GeGLU activation variant
└── tests/
    ├── test_moe_block.py
    ├── test_routers.py
    ├── test_experts.py
    └── test_configs.py
```

## **Key Design Principles**

1. **Simplified Tensor Parallelism**: TP always follows the pattern: all-gather → compute → reduce-scatter. Only the cluster axis needs configuration.

2. **Direct TTNN Calls**: All collective operations call ttnn directly, no abstraction layers.

3. **JSON-Driven Configuration**: All architectural choices controlled via JSON files.

4. **Extensible Design**: Easy to add new router types, expert architectures, and models.

## **JSON Configuration Schema**

### **Base Schema**
```json
{
  "moe_block": {
    "name": "model_name",
    "architecture_type": "sparse_moe",  // or "soft_moe", "residual_moe"
    
    "tensor_parallel": {
      "enabled": true,
      "cluster_axis": 1  // Which mesh dimension for TP (all-gather/reduce-scatter)
    },
    
    "router": {
      "type": "topk",  // or "moe_gate", "switch", "soft_router"
      "config": {
        "num_experts": 8,
        "num_experts_per_tok": 2,
        "hidden_size": 4096,
        // Router-specific parameters
      }
    },
    
    "experts": {
      "distributed": {
        "enabled": true,
        "dispatch_cluster_axis": 0,  // Which mesh dimension for EP (all-to-all)
        "intermediate_size": 14336,
        "hidden_size": 4096,
        "num_experts": 8,
        "activation": "swiglu"  // or "gelu", "geglu", "silu"
      },
      "shared": {
        "enabled": false,
        "intermediate_size": 1408,
        "parallel_with_moe": true
      },
      "dense_backbone": {  // For residual MoE architectures
        "enabled": false,
        "intermediate_size": 8192,
        "residual_weight": 0.5
      }
    }
  }
}
```

## **Configuration Examples**

### **DeepSeek-V3 Configuration**
```json
{
  "moe_block": {
    "name": "deepseek_v3_moe",
    "architecture_type": "sparse_moe",
    
    "tensor_parallel": {
      "enabled": true,
      "cluster_axis": 1  // TP on columns of mesh
    },
    
    "router": {
      "type": "moe_gate",
      "config": {
        "num_experts": 160,
        "num_experts_per_tok": 6,
        "n_routed_experts": 160,
        "n_group": 8,
        "topk_group": 3,
        "score_correction_bias": true,
        "routed_scaling_factor": 1.0,
        "memory_config": "L1_MEMORY_CONFIG"
      }
    },
    
    "experts": {
      "distributed": {
        "enabled": true,
        "dispatch_cluster_axis": 0,  // EP on rows of mesh
        "memory_config": "L1_MEMORY_CONFIG",
        "intermediate_size": 10944,
        "hidden_size": 5120,
        "num_experts_per_device": 20,
        "activation": "swiglu"
      },
      "shared": {
        "enabled": true,
        "intermediate_size": 1408,
        "hidden_size": 5120,
        "parallel_with_moe": true,
        "memory_config": "L1_MEMORY_CONFIG"
      },
      "dense_backbone": {
        "enabled": false
      }
    }
  }
}
```

### **GPT-OSS Configuration**
```json
{
  "moe_block": {
    "name": "gpt_oss_moe",
    "architecture_type": "sparse_moe",
    
    "tensor_parallel": {
      "enabled": false  // GPT-OSS doesn't use TP at MoE level
    },
    
    "router": {
      "type": "topk",
      "config": {
        "num_experts": 128,
        "num_experts_per_tok": 8,
        "hidden_size": 6144,
        "memory_config": "DRAM_MEMORY_CONFIG"
      }
    },
    
    "experts": {
      "distributed": {
        "enabled": true,
        "dispatch_cluster_axis": 0,  // EP dispatch axis
        "memory_config": "L1_MEMORY_CONFIG",
        "intermediate_size": 20480,
        "hidden_size": 6144,
        "num_experts": 128,
        "weight_dtype": "bfloat4_b",
        "activation": "swiglu",
        "swiglu_limit": 32768
      },
      "shared": {
        "enabled": false
      },
      "dense_backbone": {
        "enabled": false
      }
    }
  }
}
```

### **Mixtral Configuration**
```json
{
  "moe_block": {
    "name": "mixtral_8x7b",
    "architecture_type": "sparse_moe",
    
    "tensor_parallel": {
      "enabled": false
    },
    
    "router": {
      "type": "topk",
      "config": {
        "num_experts": 8,
        "num_experts_per_tok": 2,
        "hidden_size": 4096,
        "memory_config": "DRAM_MEMORY_CONFIG"
      }
    },
    
    "experts": {
      "distributed": {
        "enabled": true,
        "dispatch_cluster_axis": 0,
        "intermediate_size": 14336,
        "hidden_size": 4096,
        "num_experts": 8,
        "activation": "swiglu"
      },
      "shared": {
        "enabled": false
      },
      "dense_backbone": {
        "enabled": false
      }
    }
  }
}
```

### **Kimi K2 Configuration (Hypothetical)**
```json
{
  "moe_block": {
    "name": "kimi_k2",
    "architecture_type": "sparse_moe",
    
    "tensor_parallel": {
      "enabled": true,
      "cluster_axis": 1
    },
    
    "router": {
      "type": "moe_gate",
      "config": {
        "num_experts": 256,
        "num_experts_per_tok": 6,
        "score_correction_bias": true,
        "routed_scaling_factor": 1.0,
        "load_balancing": {
          "enabled": true,
          "auxiliary_loss_weight": 0.01,
          "z_loss_weight": 0.001
        }
      }
    },
    
    "experts": {
      "distributed": {
        "enabled": true,
        "dispatch_cluster_axis": 0,
        "num_experts_per_device": 32,
        "activation": "swiglu"
      },
      "shared": {
        "enabled": true,
        "parallel_with_moe": true
      }
    }
  }
}
```

### **Switch Transformer Configuration**
```json
{
  "moe_block": {
    "name": "switch_transformer",
    "architecture_type": "sparse_moe",
    
    "tensor_parallel": {
      "enabled": false
    },
    
    "router": {
      "type": "switch",
      "config": {
        "num_experts": 2048,
        "routing_strategy": {
          "type": "top_1",
          "capacity_factor": 1.25,
          "drop_tokens": true
        }
      }
    },
    
    "experts": {
      "distributed": {
        "enabled": true,
        "dispatch_cluster_axis": 0,
        "capacity_constrained": true
      }
    }
  }
}
```

### **Arctic Configuration (Residual MoE)**
```json
{
  "moe_block": {
    "name": "arctic",
    "architecture_type": "residual_moe",
    
    "tensor_parallel": {
      "enabled": false
    },
    
    "router": {
      "type": "topk",
      "config": {
        "num_experts": 128,
        "num_experts_per_tok": 2
      }
    },
    
    "experts": {
      "distributed": {
        "enabled": true,
        "dispatch_cluster_axis": 0,
        "intermediate_size": 4864,
        "residual_connection": true
      },
      "dense_backbone": {
        "enabled": true,
        "intermediate_size": 8192,
        "residual_weight": 0.5,
        "parallel_with_moe": true
      }
    }
  }
}
```

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Feature] tt-moe proposal (updated scope for Moe only) #37456

Problem Statement

Proposed Solution

Architecture

Key Design Principles

JSON Configuration Schema

Base Schema

Configuration Examples

DeepSeek-V3 Configuration

GPT-OSS Configuration

Mixtral Configuration

Kimi K2 Configuration (Hypothetical)

Switch Transformer Configuration

Arctic Configuration (Residual MoE)

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[Feature] tt-moe proposal (updated scope for Moe only) #37456

Description

Problem Statement

Proposed Solution

Architecture

Key Design Principles

JSON Configuration Schema

Base Schema

Configuration Examples

DeepSeek-V3 Configuration

GPT-OSS Configuration

Mixtral Configuration

Kimi K2 Configuration (Hypothetical)

Switch Transformer Configuration

Arctic Configuration (Residual MoE)

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions