LoRAX (LoRA eXchange) is a framework that allows users to serve thousands of fine-tuned models on a single GPU, dramatically reducing the cost of serving without compromising on throughput or latency.
🚀 Start Here: For a Robust & Reliable LoRAX Deployment
While this README.md provides a general overview, setting up a performant LoRAX server involves specific hardware, software, and environment configurations. To ensure a smooth, "impossible-to-fail" deployment experience, we highly recommend consulting our detailed LoRAX Deployment Playbook. This guide covers:
- Bulletproof Host System Setup: NVIDIA drivers, Docker,
nvidia-container-toolkit, and crucial user permissions. - GPU VRAM Considerations: Understanding LLM memory requirements and selecting compatible models for your hardware.
- Pre-Built vs. Source Deployment: Choosing the fastest path or building from source with all CUDA kernels.
- Common Pitfalls & Troubleshooting: Solutions for Hugging Face authentication, model download stalls, and more.
- 🚅 Dynamic Adapter Loading: include any fine-tuned LoRA adapter from HuggingFace, Predibase, or any filesystem in your request, it will be loaded just-in-time without blocking concurrent requests. Merge adapters per request to instantly create powerful ensembles.
- 🏋️♀️ Heterogeneous Continuous Batching: packs requests for different adapters together into the same batch, keeping latency and throughput nearly constant with the number of concurrent adapters.
- 🧁 Adapter Exchange Scheduling: asynchronously prefetches and offloads adapters between GPU and CPU memory, schedules request batching to optimize the aggregate throughput of the system.
- 👬 Optimized Inference: high throughput and low latency optimizations including tensor parallelism, pre-compiled CUDA kernels (flash-attention, paged attention, SGMV), quantization, token streaming.
- 🚢 Ready for Production prebuilt Docker images, Helm charts for Kubernetes, Prometheus metrics, and distributed tracing with Open Telemetry. OpenAI compatible API supporting multi-turn chat conversations. Private adapters through per-request tenant isolation. Structured Output (JSON mode).
- 🤯 Free for Commercial Use: Apache 2.0 License. Enough said 😎.
Serving a fine-tuned model with LoRAX consists of two components:
- Base Model: pretrained large model shared across all adapters.
- Adapter: task-specific adapter weights dynamically loaded per request.
LoRAX supports a number of Large Language Models as the base model including Llama (including CodeLlama), Mistral (including Zephyr), and Qwen. See Supported Architectures for a complete list of supported base models.
Base models can be loaded in fp16 or quantized with bitsandbytes, GPT-Q, or AWQ.
Supported adapters include LoRA adapters trained using the PEFT and Ludwig libraries. Any of the linear layers in the model can be adapted via LoRA and loaded in LoRAX.
⚙️ Model Compatibility & VRAM: Selecting the right model for your GPU's VRAM is crucial. Not all quantized models are plug-and-play due to varying toolchains. For detailed guidance on VRAM limitations and troubleshooting quantized model errors (e.g., CUDA out of memory, RuntimeError), refer to Phase 2: Deploy LoRAX in the LoRAX Deployment Playbook.
We recommend starting with our pre-built Docker image to avoid compiling custom CUDA kernels and other dependencies.
The minimum system requirements need to run LoRAX include:
- Nvidia GPU (Ampere generation or above)
- CUDA 11.8 compatible device drivers and above
- Linux OS
- Docker (for this guide)
🚨 Critical Setup Note: Meeting these requirements can be complex. For a step-by-step, verified guide on installing GPU drivers, Docker Engine, and nvidia-container-toolkit (including essential user permissions), please follow Phase 1: Host Setup in the LoRAX Deployment Playbook. Incorrect setup here is the most common cause of deployment failures.
Install nvidia-container-toolkit Then
sudo systemctl daemon-reloadsudo systemctl restart docker
💡 For the most reliable and fully explained docker run command, including critical flags (-e HUGGING_FACE_HUB_TOKEN, --user), model selection based on GPU VRAM, and troubleshooting common issues like model download stalls or quantized model compatibility, refer to our comprehensive guide: Phase 2: Deploy LoRAX and Phase 3: Test the API in the LoRAX Deployment Playbook.
model=mistralai/Mistral-7B-Instruct-v0.1
volume=$PWD/data
docker run --gpus all --shm-size 1g -p 8080:80 -v $volume:/data \
ghcr.io/predibase/lorax:main --model-id $modelFor a full tutorial including token streaming and the Python client, see Getting Started - Docker.
Prompt base LLM:
curl 127.0.0.1:8080/generate \
-X POST \
-d '{
"inputs": "[INST] Natalia sold clips to 48 of her friends in April, and then she sold half as many clips in May. How many clips did Natalia sell altogether in April and May? [/INST]",
"parameters": {
"max_new_tokens": 64
}
}' \
-H 'Content-Type: application/json'Prompt a LoRA adapter:
curl 127.0.0.1:8080/generate \
-X POST \
-d '{
"inputs": "[INST] Natalia sold clips to 48 of her friends in April, and then she sold half as many clips in May. How many clips did Natalia sell altogether in April and May? [/INST]",
"parameters": {
"max_new_tokens": 64,
"adapter_id": "vineetsharma/qlora-adapter-Mistral-7B-Instruct-v0.1-gsm8k"
}
}' \
-H 'Content-Type: application/json'See Reference - REST API for full details.
Install:
pip install lorax-clientRun:
from lorax import Client
client = Client("http://127.0.0.1:8080")
# Prompt the base LLM
prompt = "[INST] Natalia sold clips to 48 of her friends in April, and then she sold half as many clips in May. How many clips did Natalia sell altogether in April and May? [/INST]"
print(client.generate(prompt, max_new_tokens=64).generated_text)
# Prompt a LoRA adapter
adapter_id = "vineetsharma/qlora-adapter-Mistral-7B-Instruct-v0.1-gsm8k"
print(client.generate(prompt, max_new_tokens=64, adapter_id=adapter_id).generated_text)See Reference - Python Client for full details.
For other ways to run LoRAX, see Getting Started - Kubernetes, Getting Started - SkyPilot, and Getting Started - Local.
LoRAX supports multi-turn chat conversations combined with dynamic adapter loading through an OpenAI compatible API. Just specify any adapter as the model parameter.
from openai import OpenAI
client = OpenAI(
api_key="EMPTY",
base_url="http://127.0.0.1:8080/v1",
)
resp = client.chat.completions.create(
model="alignment-handbook/zephyr-7b-dpo-lora",
messages=[
{
"role": "system",
"content": "You are a friendly chatbot who always responds in the style of a pirate",
},
{"role": "user", "content": "How many helicopters can a human eat in one sitting?"},
],
max_tokens=100,
)
print("Response:", resp.choices[0].message.content)See OpenAI Compatible API for details.
Here are some other interesting Mistral-7B fine-tuned models to try out:
- alignment-handbook/zephyr-7b-dpo-lora: Mistral-7b fine-tuned on Zephyr-7B dataset with DPO.
- IlyaGusev/saiga_mistral_7b_lora: Russian chatbot based on
Open-Orca/Mistral-7B-OpenOrca. - Undi95/Mistral-7B-roleplay_alpaca-lora: Fine-tuned using role-play prompts.
You can find more LoRA adapters here, or try fine-tuning your own with PEFT or Ludwig.
LoRAX is built on top of HuggingFace's text-generation-inference, forked from v0.9.4 (Apache 2.0).
We'd also like to acknowledge Punica for their work on the SGMV kernel, which is used to speed up multi-adapter inference under heavy load.
Our roadmap is tracked here.
