Elastic Gemma3n Demo

This repo demonstrates elastic inference capabilities with the Gemma3n model, allowing dynamic adjustment of model capacity through MLP sharding and layer skipping.

Features

Model Configurations

E4B: Full model with all 35 layers and maximum capacity (8/8 shards per layer)
E2B: Reduced model with 5 layers skipped and smaller MLPs (4/8 shards per layer)
E2.54B: Hybrid configuration with selective high-capacity layers
E2.98B: Another hybrid configuration with different capacity distribution

Key Ideas

ShardedMLP: Allows dynamic adjustment of MLP intermediate dimensions by using 1-8 shards/chunks of the original matmul operation.
Monkey Patching: Runtime modification of model architecture (layer skipping and variable MLP sharding) without reloading
Streamlit UI: Interactive interface for real-time configuration changes

Installation

pip install -r requirements.txt

Usage

Streamlit Interface

streamlit run app.py

This launches a web interface where you can:

Select different model configurations
Apply changes in real-time
Chat with the model using different capacity settings
Compare performance across configurations

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
app.py		app.py
monkey_patching.py		monkey_patching.py
requirements.txt		requirements.txt
sharded_mlp.py		sharded_mlp.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Elastic Gemma3n Demo

Features

Model Configurations

Key Ideas

Installation

Usage

Streamlit Interface

About

Uh oh!

Releases

Packages

Languages

License

srjoglekar246/elastic_gemma3n

Folders and files

Latest commit

History

Repository files navigation

Elastic Gemma3n Demo

Features

Model Configurations

Key Ideas

Installation

Usage

Streamlit Interface

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages