Skip to content

srjoglekar246/elastic_gemma3n

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

2 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Elastic Gemma3n Demo

This repo demonstrates elastic inference capabilities with the Gemma3n model, allowing dynamic adjustment of model capacity through MLP sharding and layer skipping.

Features

Model Configurations

  • E4B: Full model with all 35 layers and maximum capacity (8/8 shards per layer)
  • E2B: Reduced model with 5 layers skipped and smaller MLPs (4/8 shards per layer)
  • E2.54B: Hybrid configuration with selective high-capacity layers
  • E2.98B: Another hybrid configuration with different capacity distribution

Key Ideas

  1. ShardedMLP: Allows dynamic adjustment of MLP intermediate dimensions by using 1-8 shards/chunks of the original matmul operation.
  2. Monkey Patching: Runtime modification of model architecture (layer skipping and variable MLP sharding) without reloading
  3. Streamlit UI: Interactive interface for real-time configuration changes

Installation

pip install -r requirements.txt

Usage

Streamlit Interface

streamlit run app.py

This launches a web interface where you can:

  • Select different model configurations
  • Apply changes in real-time
  • Chat with the model using different capacity settings
  • Compare performance across configurations

About

Demonstrate elastic inference leveraging Gemma3n's Matryoshka layers with MLX

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages