Research project analyzing cross-layer routing patterns in Mixture-of-Experts (MoE) models for speculative pipeline parallelism.
This project collects and analyzes routing decisions from Mixtral-8x7B to discover predictable patterns that can be exploited for speculative pipeline parallelism optimization.
Key Result: We achieve 55.6% Top-1 and 73.2% Top-2 accuracy for expert prediction — an accuracy improvement of +345% (4.5×) over random guessing.
Mixtral-8x7B-Instruct-v0.1-FP8 (8-bit quantized)
- 8 experts per layer
- Top-2 routing (selects 2 experts per token)
- 32 MoE transformer layers
We use FP8 quantization to fit Mixtral on a single GPU. This is valid for routing analysis because:
The gating/router network is NOT quantized — it always runs at FP16/BF16 precision.
In MoE models:
- The router computes gating logits to decide which experts to use
- The experts (FFN weights) perform the actual computation
- Quantization (FP8/AWQ/GPTQ) only applies to expert weights, not the router
This means routing patterns remain identical to full-precision models. MoE routers are highly sensitive, and quantization noise could route tokens to wrong experts — so vLLM and other frameworks explicitly keep routers unquantized.
- HumanEval: 164 problems (code generation) — openai_humaneval
- GSM8K: 80 problems profiled (math reasoning) — gsm8k
Combined: 1.43M routing records across 244 problems, split 98/1/1 for train/val/test.
speculative-moe/
├── README.md # This file
├── proposal.pdf # Project proposal
│
├── profiling_moe_code/ # Custom profiling code for data collection
│ ├── README.md # Setup instructions
│ ├── collect_data.py # Main data collection script
│ ├── custom_profiling/ # vLLM routing profiler
│ └── LAYER_PATCH.txt # vLLM layer.py patch
│
├── routing_data_collected/ # Collected routing data
│ ├── README.md # Data format documentation
│ ├── sample/ # Sample datasets (2 problems each)
│ ├── humaneval_full_*.jsonl # Full HumanEval data (164 problems)
│ └── gsm8k_full_*.jsonl # Full GSM8K data (80 problems profiled)
│
├── experiment_and_evaluation/ # All experiments and results
│ ├── README.md # Experiment navigation guide
│ ├── preprocess_data.py # Data preprocessing
│ ├── processed_data/ # Train/val/test splits
│ ├── 1_lookup_table/ # Step 1: Frequency-based rules
│ ├── 2_xgboost/ # Step 2: Classical ML baseline
│ ├── 3_bayesian_search/ # Step 3: Hyperparameter optimization
│ ├── 4_neural_network_mlp/ # Step 4: Final optimized model
│ └── Final_Experiment_Evaluation_Summary.md # Results summary
│
└── others/ # Legacy scripts and notes
For each token at each MoE layer, we capture:
problem_id— which problem from the datasetlayer— which MoE layer (0-31 for Mixtral)token_idx— position in the generated sequenceexperts— which top-2 experts were selectedgating_probs— gating probabilities for selected experts
1. Layer-Expert Patterns
Certain experts are preferred at certain layers. We captured this in our Level 1 lookup rules (Layer → Expert frequency).
2. Cross-Layer Transitions
Expert transitions between layers are not random — they follow predictable patterns. Our Level 2-3 rules capture (Layer, Previous Expert(s), Current Expert) → Next Expert probabilities.
3. Predictability — The Main Finding
Yes, we can predict Layer N+1 routing from Layer N decisions. Our best model achieves:
| Model | Top-1 Accuracy | Top-2 Accuracy | vs Random |
|---|---|---|---|
| Random Baseline | 12.5% | 25.0% | — |
| Lookup Table (Best) | 42.2% | 56.2% | +237% (3.4x) |
| XGBoost | 36.2% | 53.3% | +189% (2.9x) |
| Neural Network (MLP) | 55.6% | 73.2% | +345% (4.5x) |
This means if we speculatively prepare 2 experts ahead of time, we'll be correct 73% of the time — enabling significant pipeline parallelism gains.
- Python 3.11+
- CUDA GPU with ~45GB VRAM (for FP8 Mixtral)
- vLLM with custom profiling patch
See profiling_moe_code/README.md for full setup instructions.
# Install dependencies
pip install vllm datasets
# Apply vLLM patch (see profiling_moe_code/LAYER_PATCH.txt)
# Copy profiling_moe_code/custom_profiling/ to your vLLM installation
# Run data collection
cd profiling_moe_code
python collect_data.py --dataset humaneval --num-problems 5Data is saved as JSONL (one JSON object per line):
{"dataset": "humaneval", "problem_id": 0, "layer": 0, "token_idx": 0, "experts": [3, 7], "gating_probs": [0.68, 0.32]}- Mixtral of Experts — Mistral AI
- vLLM — Fast LLM inference
Research project for academic purposes.