Skip to content

remyjkim/local-mlx-server

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

8 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

LLM Local Server with MLX and Open WebUI

This project sets up a local environment to run Large Language Models (LLMs) using the MLX framework and interact with them through the Open WebUI interface.

Prerequisites

Setup

  1. Install Flox: If you haven't already, install Flox by following the instructions here.

  2. Activate Flox Environment: Navigate to the project directory (llm-local-server) in your terminal and activate the Flox environment. This will install uv and other necessary tools defined in the flox.nix file (if one exists).

    cd llm-local-server
    flox activate
  3. Install Python Dependencies: Use uv to sync the Python dependencies listed in requirements.txt or pyproject.toml.

    uv sync

    (Note: Make sure you have a requirements.txt or pyproject.toml file with mlx-lm listed).

  4. Make the Script Executable: Grant execution permissions to the run script.

    chmod +x run_mlx_openwebui.sh

Running the Server and UI

  1. Execute the Run Script: Run the script, providing the name of the MLX model you want to load as an argument. Replace <your_model_name> with the actual model identifier (e.g., mlx-community/Mistral-7B-Instruct-v0.2).

    ./run_mlx_openwebui.sh <your_model_name>

    This script will:

    • Start the Open WebUI container in the background using docker compose. The docker-compose.yml file maps host port 3000 to the container's port 8080 and uses host.docker.internal to allow the container to communicate back to the MLX server running on your host machine.
    • Open a new terminal window and start the MLX LM server, loading the specified model and listening on port 8000.
  2. Access Open WebUI: Open your web browser and navigate to http://localhost:3000 (as mapped in the docker-compose.yml).

  3. Configure Open WebUI:

    • In Open WebUI, go to Settings -> Connections.
    • Set the API Base URL to http://host.docker.internal:8000/v1. (Open WebUI, running inside Docker, will use this URL to connect to the MLX server running on your host machine thanks to the extra_hosts setting in docker-compose.yml).
    • You should now be able to select and interact with your locally running model.

Available MLX Models for MacBook Pro 64GB RAM

This comprehensive list includes models optimized for Apple Silicon using the MLX framework. All models are compatible with mlx-lm and have been tested on MacBook Pro systems with 64GB unified memory.

Memory Requirements Guide

  • Small Models (1-8B): 3-10 GB RAM (4-bit quantized)
  • Medium Models (8-14B): 10-20 GB RAM (4-bit quantized)
  • Large Models (14-32B): 20-35 GB RAM (4-bit quantized)
  • Extra Large Models (32-70B): 35-50 GB RAM (4-bit quantized)

With 64GB RAM, you can run models up to 70B parameters (4-bit quantized) or run multiple smaller models simultaneously.

πŸ† Recommended Models for 64GB Systems

Best Overall Performance

  • mlx-community/Llama-3.3-70B-Instruct-4bit (40GB) - GPT-4 class performance
  • mlx-community/DeepSeek-R1-Distill-Qwen-32B-4bit (18.5GB) - Excellent reasoning capabilities

Best Coding Models

  • mlx-community/Qwen2.5-Coder-32B-Instruct-4bit (~18GB) - Top coding performance, 92 languages
  • mlx-community/DeepSeek-Coder-V2-Instruct-4bit - Specialized for code generation
  • mlx-community/Qwen3-Coder-Flash (30.5B MoE) - Fast coding with MoE architecture

Best Balanced Models (Quality + Speed)

  • lmstudio-community/Mistral-Small-3.2-24B-Instruct-2506-MLX-4bit (~12GB) - GPT-4 quality
  • mlx-community/Qwen3-8B-4bit - Excellent general performance, compact size
  • mlx-community/Meta-Llama-3.1-8B-Instruct-4bit - Reliable and fast

πŸ“š Complete Model Catalog by Family

Llama Family

Llama 3.3 (70B)

  • mlx-community/Llama-3.3-70B-Instruct-4bit (40GB)

Llama 3.2 (Vision-capable)

  • mlx-community/Llama-3.2-11B-Vision-Instruct-4bit (~7GB) - Multimodal
  • mlx-community/Llama-3.2-90B-Vision-Instruct-4bit (~50GB) - Large multimodal

Llama 3.1

  • mlx-community/Meta-Llama-3.1-8B-Instruct-4bit (~5GB)
  • mlx-community/Meta-Llama-3.1-70B-Instruct-4bit (~40GB)

Llama 3

  • mlx-community/Meta-Llama-3-8B-Instruct-4bit (~5GB)
  • mlx-community/Meta-Llama-3-70B-Instruct-4bit (~40GB)

Qwen Family

Qwen 3 (Latest)

  • mlx-community/Qwen3-235B-A22B-8bit - Massive MoE model
  • mlx-community/Qwen3-8B-4bit (~5GB)
  • mlx-community/Qwen3-4B-4bit (~3GB)
  • mlx-community/Qwen3-1.5B-4bit (~2GB)

Qwen 3 Coder

  • mlx-community/Qwen3-Coder-Flash (30.5B MoE, 3.3B active)
  • mlx-community/Qwen3-Coder-32B-4bit

Qwen 2.5

  • mlx-community/Qwen2.5-72B-Instruct-4bit (~40GB)
  • mlx-community/Qwen2.5-32B-Instruct-4bit (~18GB)
  • mlx-community/Qwen2.5-14B-Instruct-4bit (~8GB)
  • mlx-community/Qwen2.5-7B-Instruct-4bit (~4GB)

Qwen 2.5 Coder (Specialized for Coding)

  • mlx-community/Qwen2.5-Coder-32B-Instruct-4bit (~18GB)
  • mlx-community/Qwen2.5-Coder-14B-Instruct-4bit (~8GB)
  • mlx-community/Qwen2.5-Coder-7B-Instruct-4bit (~4GB)

Qwen 2.5 Vision

  • mlx-community/Qwen2.5-VL-3B-Instruct-4bit - Multimodal capabilities

QwQ (Reasoning)

  • mlx-community/QwQ-32B-Preview-4bit (18.5GB, 32K context)

DeepSeek Family

DeepSeek R1 (Reasoning Models)

  • lmstudio-community/DeepSeek-R1-0528-Qwen3-8B-MLX-4bit
  • mlx-community/DeepSeek-R1-Distill-Qwen-32B-4bit (18.5GB)
  • mlx-community/DeepSeek-R1-Distill-Qwen-14B-4bit
  • mlx-community/DeepSeek-R1-Distill-Qwen-1.5B-4bit (both 4-bit and 8-bit)

DeepSeek V3

  • mlx-community/DeepSeek-V3.1-4bit

DeepSeek Coder

  • mlx-community/DeepSeek-Coder-V2-Instruct-4bit
  • Models range from 1.3B to 33B parameters

Mistral Family

Mistral Small

  • mlx-community/Mistral-Small-24B-Instruct-2501-4bit (~12GB)
  • lmstudio-community/Mistral-Small-3.2-24B-Instruct-2506-MLX-4bit
  • lmstudio-community/Mistral-Small-3.2-24B-Instruct-2506-MLX-8bit
  • mlx-community/Mistral-Small-Instruct-2409-4bit

Mistral 7B

  • mlx-community/Mistral-7B-Instruct-v0.3-4bit (Default in run_mlx_openwebui.sh)
  • mlx-community/Mistral-7B-Instruct-v0.2-4-bit
  • mlx-community/Mistral-7B-v0.1-hf-4bit-mlx

Pixtral (Vision)

  • mlx-community/pixtral-12b-4bit (7.15GB) - Multimodal, 128K context

Phi Family (Microsoft)

Phi-4 (14B - Latest)

  • mlx-community/phi-4-4bit (~8GB)
  • mlx-community/phi-4-8bit (~14GB)
  • lmstudio-community/Phi-4-reasoning-MLX-4bit
  • lmstudio-community/Phi-4-mini-reasoning-MLX-4bit

Phi-3

  • mlx-community/Phi-3-mini-4k-instruct-4bit (~2GB)
  • mlx-community/Phi-3-mini-4k-instruct-8bit (~4GB)

Gemma Family (Google)

Gemma 3

  • mlx-community/gemma-3n-E4B-it-lm-4bit (MoE model)
  • lmstudio-community/gemma-3n-E4B-it-MLX-4bit
  • lmstudio-community/gemma-3n-E4B-it-MLX-8bit
  • mlx-community/gemma-3-4b-it-8bit
  • mlx-community/gemma-3-4b-pt-4bit
  • mlx-community/gemma-3-1b-it-4bit

Other Notable Models

StableLM

  • mlx-community/stablelm-2-zephyr-1_6b-4bit (~1GB)

H2O Danube

  • ucheog/h2o-danube2-1.8b-chat-MLX-4bit (~1GB)

DBRX

  • Quantized versions available (requires 64GB+ RAM)

🎨 Multimodal (Vision) Models

These models can process both images and text:

  • mlx-community/Llama-3.2-11B-Vision-Instruct-4bit (~7GB)
  • mlx-community/Llama-3.2-90B-Vision-Instruct-4bit (~50GB)
  • mlx-community/pixtral-12b-4bit (7.15GB) - Mistral's vision model
  • mlx-community/Qwen2.5-VL-3B-Instruct-4bit - Qwen vision variant

πŸ“Š Understanding Quantization

4-bit quantization: Best balance of quality and size (recommended for 64GB systems)

  • Typically uses ~2.5GB per 7B parameters
  • Minimal quality loss for most tasks

8-bit quantization: Higher quality, larger size

  • Typically uses ~5GB per 7B parameters
  • Better for tasks requiring high precision

BF16/FP16: Full precision

  • Maximum quality but 4x larger than 4-bit
  • Only for smaller models on 64GB systems

πŸ” Finding More Models

Official Sources:

Collections by Model Family:

Tips:

  • Look for -4bit or -8bit in model names for quantized versions
  • Models with -MLX suffix are optimized for Apple Silicon
  • Check model cards for memory requirements and benchmarks
  • MoE (Mixture of Experts) models can run larger parameter counts efficiently

⚑ Performance Tips for 64GB Systems

  1. Use 4-bit quantization for the best balance of performance and quality
  2. For coding tasks: Qwen2.5-Coder-32B-4bit offers the best results
  3. For reasoning: DeepSeek-R1-Distill-Qwen-32B-4bit or QwQ-32B-Preview-4bit
  4. For general use: Llama-3.3-70B-Instruct-4bit (if you have memory) or Mistral-Small-24B
  5. Run multiple models: You can run 2-3 smaller models (7B-8B) simultaneously
  6. Use llm chat for larger models to keep them in memory between conversations

Stopping Services

  1. Stop MLX Server: Close the terminal window where the uv run mlx_lm.server... command is running, or press Ctrl+C in that window.

  2. Stop Open WebUI: Run the following command in the project directory (using the modern docker compose syntax):

    docker compose down

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages