This repo demonstrates elastic inference capabilities with the Gemma3n model, allowing dynamic adjustment of model capacity through MLP sharding and layer skipping.
- E4B: Full model with all 35 layers and maximum capacity (8/8 shards per layer)
- E2B: Reduced model with 5 layers skipped and smaller MLPs (4/8 shards per layer)
- E2.54B: Hybrid configuration with selective high-capacity layers
- E2.98B: Another hybrid configuration with different capacity distribution
- ShardedMLP: Allows dynamic adjustment of MLP intermediate dimensions by using 1-8 shards/chunks of the original matmul operation.
- Monkey Patching: Runtime modification of model architecture (layer skipping and variable MLP sharding) without reloading
- Streamlit UI: Interactive interface for real-time configuration changes
pip install -r requirements.txtstreamlit run app.pyThis launches a web interface where you can:
- Select different model configurations
- Apply changes in real-time
- Chat with the model using different capacity settings
- Compare performance across configurations