feat: Multi-model management with on-demand loading and LRU eviction

## Problem

On Apple Silicon machines with large unified memory (e.g., 128GB+ Mac Studio), running a single model underutilizes available resources. Users with multiple models want to serve them concurrently without spawning multiple server processes on different ports.

Current limitations:
- `vllm-mlx serve` only loads one model at startup
- No way to dynamically load/unload models based on demand
- Memory sits idle when only one model is loaded
- Users must manage multiple processes manually for multi-model serving

## Proposed Solution

Add a **multi-model manager** with:

### 1. Model Registry
- Configure multiple models via config file or CLI
- Support model aliases (e.g., "fast" → "Qwen3-8B", "smart" → "Qwen3-72B")

### 2. On-Demand Loading
- Load models lazily when first request arrives
- Support pre-loading (warm start) for critical models
- Track memory usage per model

### 3. LRU Eviction
- Evict least-recently-used models when memory threshold exceeded
- Configurable max memory budget
- Graceful eviction: finish pending requests before unloading

### 4. API Compatibility
- `GET /v1/models` returns all configured models with status (loaded/unloaded)
- Requests with `model` parameter trigger auto-loading if not in memory
- Consistent with OpenAI API behavior

## Example Usage

```bash
# Config file: models.yaml
models:
  - name: qwen-35b
    path: ~/models/Qwen3.5-35B-A3B
    alias: smart
    preload: true
  - name: qwen-8b
    path: ~/models/Qwen3-8B
    alias: fast
  - name: llama-70b
    path: mlx-community/Llama-3.3-70B-Instruct-4bit

# Start server with multi-model support
vllm-mlx serve --config models.yaml --max-memory 100GB
```

## Benefits
- **Efficiency**: Utilize large unified memory effectively
- **Simplicity**: Single server process, single port
- **Compatibility**: Works with existing OpenAI-compatible clients
- **Cost-effective**: No need for multiple GPU instances

## Related
- #102 (model switching) - This extends that to full multi-model management

## Implementation Notes
This would require:
1. A `ModelManager` class to track loaded models
2. Memory tracking and eviction logic
3. Request routing to appropriate model
4. Graceful loading/unloading with request queuing

Happy to contribute if this aligns with the project roadmap!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: Multi-model management with on-demand loading and LRU eviction #191

Problem

Proposed Solution

1. Model Registry

2. On-Demand Loading

3. LRU Eviction

4. API Compatibility

Example Usage

Benefits

Related

Implementation Notes

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

feat: Multi-model management with on-demand loading and LRU eviction #191

Description

Problem

Proposed Solution

1. Model Registry

2. On-Demand Loading

3. LRU Eviction

4. API Compatibility

Example Usage

Benefits

Related

Implementation Notes

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions