Skip to content

[Feature] tt-moe proposal (updated scope for Moe only) #37456

@ntarafdar

Description

@ntarafdar

Problem Statement

Currently, DeepSeek-V3 and GPT-OSS maintain separate MoE implementations with significant architectural differences:

DeepSeek-V3:

  • MoEGate router with score correction bias and expert scaling
  • SharedExpert runs in parallel with MoE experts
  • All-gather/reduce-scatter for tensor parallelism at the decoder block level
  • All-to-all dispatch/combine operations embedded within MoE module

GPT-OSS:

  • TopKRouter with simple linear + topk + softmax logic
  • Two expert modes: standard (sparse matmul) and throughput (all-to-all)
  • No shared expert component
  • Different collective operation patterns

This separation causes:

  • Code duplication for similar MoE functionality
  • Inability to test different configurations easily
  • Maintenance burden for two separate implementations
  • Difficulty comparing architectures side-by-side
  • No clear path to support other MoE models (Mixtral, Kimi K2, Switch Transformer, etc.)

Proposed Solution

Create a single, configurable MoEBlock module that can be configured via JSON to support both DeepSeek-V3 and GPT-OSS architectures, with extensibility for other MoE models. This is a focused scope reduction from the full tt-moe library proposed in #37354.

Architecture

models/demos/unified_moe/
├── moe_block.py           # Main unified MoE block
├── configs/
│   ├── deepseek_v3.json   # DeepSeek-V3 configuration
│   ├── gpt_oss.json       # GPT-OSS configuration
│   ├── mixtral.json       # Mixtral 8x7B configuration
│   ├── switch_transformer.json  # Switch Transformer config
│   ├── arctic.json        # Arctic residual MoE config
│   └── kimi_k2.json       # Hypothetical Kimi K2 config
├── components/
│   ├── routers/
│   │   ├── base_router.py
│   │   ├── topk_router.py      # GPT-OSS style
│   │   ├── moe_gate.py         # DeepSeek style with corrections
│   │   ├── switch_router.py    # Top-1 with capacity
│   │   └── soft_router.py      # Soft routing (all experts)
│   └── experts/
│       ├── base_expert.py
│       ├── distributed_expert.py  # With all-to-all
│       ├── standard_expert.py     # Sparse matmul
│       ├── shared_expert.py       # DeepSeek shared expert
│       ├── residual_expert.py     # Arctic-style with dense backbone
│       └── geglu_expert.py        # GeGLU activation variant
└── tests/
    ├── test_moe_block.py
    ├── test_routers.py
    ├── test_experts.py
    └── test_configs.py

Key Design Principles

  1. Simplified Tensor Parallelism: TP always follows the pattern: all-gather → compute → reduce-scatter. Only the cluster axis needs configuration.

  2. Direct TTNN Calls: All collective operations call ttnn directly, no abstraction layers.

  3. JSON-Driven Configuration: All architectural choices controlled via JSON files.

  4. Extensible Design: Easy to add new router types, expert architectures, and models.

JSON Configuration Schema

Base Schema

{
  "moe_block": {
    "name": "model_name",
    "architecture_type": "sparse_moe",  // or "soft_moe", "residual_moe"
    
    "tensor_parallel": {
      "enabled": true,
      "cluster_axis": 1  // Which mesh dimension for TP (all-gather/reduce-scatter)
    },
    
    "router": {
      "type": "topk",  // or "moe_gate", "switch", "soft_router"
      "config": {
        "num_experts": 8,
        "num_experts_per_tok": 2,
        "hidden_size": 4096,
        // Router-specific parameters
      }
    },
    
    "experts": {
      "distributed": {
        "enabled": true,
        "dispatch_cluster_axis": 0,  // Which mesh dimension for EP (all-to-all)
        "intermediate_size": 14336,
        "hidden_size": 4096,
        "num_experts": 8,
        "activation": "swiglu"  // or "gelu", "geglu", "silu"
      },
      "shared": {
        "enabled": false,
        "intermediate_size": 1408,
        "parallel_with_moe": true
      },
      "dense_backbone": {  // For residual MoE architectures
        "enabled": false,
        "intermediate_size": 8192,
        "residual_weight": 0.5
      }
    }
  }
}

Configuration Examples

DeepSeek-V3 Configuration

{
  "moe_block": {
    "name": "deepseek_v3_moe",
    "architecture_type": "sparse_moe",
    
    "tensor_parallel": {
      "enabled": true,
      "cluster_axis": 1  // TP on columns of mesh
    },
    
    "router": {
      "type": "moe_gate",
      "config": {
        "num_experts": 160,
        "num_experts_per_tok": 6,
        "n_routed_experts": 160,
        "n_group": 8,
        "topk_group": 3,
        "score_correction_bias": true,
        "routed_scaling_factor": 1.0,
        "memory_config": "L1_MEMORY_CONFIG"
      }
    },
    
    "experts": {
      "distributed": {
        "enabled": true,
        "dispatch_cluster_axis": 0,  // EP on rows of mesh
        "memory_config": "L1_MEMORY_CONFIG",
        "intermediate_size": 10944,
        "hidden_size": 5120,
        "num_experts_per_device": 20,
        "activation": "swiglu"
      },
      "shared": {
        "enabled": true,
        "intermediate_size": 1408,
        "hidden_size": 5120,
        "parallel_with_moe": true,
        "memory_config": "L1_MEMORY_CONFIG"
      },
      "dense_backbone": {
        "enabled": false
      }
    }
  }
}

GPT-OSS Configuration

{
  "moe_block": {
    "name": "gpt_oss_moe",
    "architecture_type": "sparse_moe",
    
    "tensor_parallel": {
      "enabled": false  // GPT-OSS doesn't use TP at MoE level
    },
    
    "router": {
      "type": "topk",
      "config": {
        "num_experts": 128,
        "num_experts_per_tok": 8,
        "hidden_size": 6144,
        "memory_config": "DRAM_MEMORY_CONFIG"
      }
    },
    
    "experts": {
      "distributed": {
        "enabled": true,
        "dispatch_cluster_axis": 0,  // EP dispatch axis
        "memory_config": "L1_MEMORY_CONFIG",
        "intermediate_size": 20480,
        "hidden_size": 6144,
        "num_experts": 128,
        "weight_dtype": "bfloat4_b",
        "activation": "swiglu",
        "swiglu_limit": 32768
      },
      "shared": {
        "enabled": false
      },
      "dense_backbone": {
        "enabled": false
      }
    }
  }
}

Mixtral Configuration

{
  "moe_block": {
    "name": "mixtral_8x7b",
    "architecture_type": "sparse_moe",
    
    "tensor_parallel": {
      "enabled": false
    },
    
    "router": {
      "type": "topk",
      "config": {
        "num_experts": 8,
        "num_experts_per_tok": 2,
        "hidden_size": 4096,
        "memory_config": "DRAM_MEMORY_CONFIG"
      }
    },
    
    "experts": {
      "distributed": {
        "enabled": true,
        "dispatch_cluster_axis": 0,
        "intermediate_size": 14336,
        "hidden_size": 4096,
        "num_experts": 8,
        "activation": "swiglu"
      },
      "shared": {
        "enabled": false
      },
      "dense_backbone": {
        "enabled": false
      }
    }
  }
}

Kimi K2 Configuration (Hypothetical)

{
  "moe_block": {
    "name": "kimi_k2",
    "architecture_type": "sparse_moe",
    
    "tensor_parallel": {
      "enabled": true,
      "cluster_axis": 1
    },
    
    "router": {
      "type": "moe_gate",
      "config": {
        "num_experts": 256,
        "num_experts_per_tok": 6,
        "score_correction_bias": true,
        "routed_scaling_factor": 1.0,
        "load_balancing": {
          "enabled": true,
          "auxiliary_loss_weight": 0.01,
          "z_loss_weight": 0.001
        }
      }
    },
    
    "experts": {
      "distributed": {
        "enabled": true,
        "dispatch_cluster_axis": 0,
        "num_experts_per_device": 32,
        "activation": "swiglu"
      },
      "shared": {
        "enabled": true,
        "parallel_with_moe": true
      }
    }
  }
}

Switch Transformer Configuration

{
  "moe_block": {
    "name": "switch_transformer",
    "architecture_type": "sparse_moe",
    
    "tensor_parallel": {
      "enabled": false
    },
    
    "router": {
      "type": "switch",
      "config": {
        "num_experts": 2048,
        "routing_strategy": {
          "type": "top_1",
          "capacity_factor": 1.25,
          "drop_tokens": true
        }
      }
    },
    
    "experts": {
      "distributed": {
        "enabled": true,
        "dispatch_cluster_axis": 0,
        "capacity_constrained": true
      }
    }
  }
}

Arctic Configuration (Residual MoE)

{
  "moe_block": {
    "name": "arctic",
    "architecture_type": "residual_moe",
    
    "tensor_parallel": {
      "enabled": false
    },
    
    "router": {
      "type": "topk",
      "config": {
        "num_experts": 128,
        "num_experts_per_tok": 2
      }
    },
    
    "experts": {
      "distributed": {
        "enabled": true,
        "dispatch_cluster_axis": 0,
        "intermediate_size": 4864,
        "residual_connection": true
      },
      "dense_backbone": {
        "enabled": true,
        "intermediate_size": 8192,
        "residual_weight": 0.5,
        "parallel_with_moe": true
      }
    }
  }
}

Metadata

Metadata

Assignees

Labels

No labels
No labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions