Official Repository for X-EcoMLA and Zebra-Llama
Welcome! This repo hosts two complementary projects that focus on memory-efficient and high-performance large language models (LLMs). Large Language Models (LLMs) often face major memory bottlenecks due to large key-value (KV) caches during inference. This repository introduces two solutions:
| Folder | Description |
|---|---|
x-eco-mla/ |
Implements X-EcoMLA: a method for upcycling attention into Multi-head Latent Attention (MLA) for extreme KV cache compression. |
zebra-llama/ |
Implements Zebra-Llama: a family of hybrid MLA + Mamba2 models with minimal retraining and maximum efficiency. |
If you find this repository useful in your research or application, please cite our paper:
@article{li2025x_ecomla,
title={{X-EcoMLA}: Upcycling Pre-Trained Attention into {MLA} for Efficient and Extreme {KV} Compression},
author={Li, Guihong and Rezagholizadeh, Mehdi and Yang, Mingyu and Appia, Vikram and Barsoum, Emad},
journal={arXiv preprint arXiv:2503.11132},
year={2025},
url={https://arxiv.org/abs/2503.11132}
}
@article{yang2025zebra,
title={Zebra-Llama: Towards Extremely Efficient Hybrid Models},
author={Yang, Mingyu and Rezagholizadeh, Mehdi and Li, Guihong and Appia, Vikram and Barsoum, Emad},
journal={arXiv preprint arXiv:2505.17272},
year={2025}
}We welcome contributions! Please open an issue to discuss questions and major changes.
This project is licensed under the Apache License 2.0. See the LICENSE file for details.