Skip to content

feat: Multi-model management with on-demand loading and LRU eviction #191

@sysit

Description

@sysit

Problem

On Apple Silicon machines with large unified memory (e.g., 128GB+ Mac Studio), running a single model underutilizes available resources. Users with multiple models want to serve them concurrently without spawning multiple server processes on different ports.

Current limitations:

  • vllm-mlx serve only loads one model at startup
  • No way to dynamically load/unload models based on demand
  • Memory sits idle when only one model is loaded
  • Users must manage multiple processes manually for multi-model serving

Proposed Solution

Add a multi-model manager with:

1. Model Registry

  • Configure multiple models via config file or CLI
  • Support model aliases (e.g., "fast" → "Qwen3-8B", "smart" → "Qwen3-72B")

2. On-Demand Loading

  • Load models lazily when first request arrives
  • Support pre-loading (warm start) for critical models
  • Track memory usage per model

3. LRU Eviction

  • Evict least-recently-used models when memory threshold exceeded
  • Configurable max memory budget
  • Graceful eviction: finish pending requests before unloading

4. API Compatibility

  • GET /v1/models returns all configured models with status (loaded/unloaded)
  • Requests with model parameter trigger auto-loading if not in memory
  • Consistent with OpenAI API behavior

Example Usage

# Config file: models.yaml
models:
  - name: qwen-35b
    path: ~/models/Qwen3.5-35B-A3B
    alias: smart
    preload: true
  - name: qwen-8b
    path: ~/models/Qwen3-8B
    alias: fast
  - name: llama-70b
    path: mlx-community/Llama-3.3-70B-Instruct-4bit

# Start server with multi-model support
vllm-mlx serve --config models.yaml --max-memory 100GB

Benefits

  • Efficiency: Utilize large unified memory effectively
  • Simplicity: Single server process, single port
  • Compatibility: Works with existing OpenAI-compatible clients
  • Cost-effective: No need for multiple GPU instances

Related

Implementation Notes

This would require:

  1. A ModelManager class to track loaded models
  2. Memory tracking and eviction logic
  3. Request routing to appropriate model
  4. Graceful loading/unloading with request queuing

Happy to contribute if this aligns with the project roadmap!

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions