-
Notifications
You must be signed in to change notification settings - Fork 161
Open
Description
Problem
On Apple Silicon machines with large unified memory (e.g., 128GB+ Mac Studio), running a single model underutilizes available resources. Users with multiple models want to serve them concurrently without spawning multiple server processes on different ports.
Current limitations:
vllm-mlx serveonly loads one model at startup- No way to dynamically load/unload models based on demand
- Memory sits idle when only one model is loaded
- Users must manage multiple processes manually for multi-model serving
Proposed Solution
Add a multi-model manager with:
1. Model Registry
- Configure multiple models via config file or CLI
- Support model aliases (e.g., "fast" → "Qwen3-8B", "smart" → "Qwen3-72B")
2. On-Demand Loading
- Load models lazily when first request arrives
- Support pre-loading (warm start) for critical models
- Track memory usage per model
3. LRU Eviction
- Evict least-recently-used models when memory threshold exceeded
- Configurable max memory budget
- Graceful eviction: finish pending requests before unloading
4. API Compatibility
GET /v1/modelsreturns all configured models with status (loaded/unloaded)- Requests with
modelparameter trigger auto-loading if not in memory - Consistent with OpenAI API behavior
Example Usage
# Config file: models.yaml
models:
- name: qwen-35b
path: ~/models/Qwen3.5-35B-A3B
alias: smart
preload: true
- name: qwen-8b
path: ~/models/Qwen3-8B
alias: fast
- name: llama-70b
path: mlx-community/Llama-3.3-70B-Instruct-4bit
# Start server with multi-model support
vllm-mlx serve --config models.yaml --max-memory 100GBBenefits
- Efficiency: Utilize large unified memory effectively
- Simplicity: Single server process, single port
- Compatibility: Works with existing OpenAI-compatible clients
- Cost-effective: No need for multiple GPU instances
Related
- vllm-mlx serve support for model switching #102 (model switching) - This extends that to full multi-model management
Implementation Notes
This would require:
- A
ModelManagerclass to track loaded models - Memory tracking and eviction logic
- Request routing to appropriate model
- Graceful loading/unloading with request queuing
Happy to contribute if this aligns with the project roadmap!
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
No labels