LLM Local Server with MLX and Open WebUI

This project sets up a local environment to run Large Language Models (LLMs) using the MLX framework and interact with them through the Open WebUI interface.

Prerequisites

Docker and Docker Compose
Flox
uv (will be installed via Flox)

Setup

Install Flox: If you haven't already, install Flox by following the instructions here.
Activate Flox Environment: Navigate to the project directory (llm-local-server) in your terminal and activate the Flox environment. This will install uv and other necessary tools defined in the flox.nix file (if one exists).
```
cd llm-local-server
flox activate
```
Install Python Dependencies: Use uv to sync the Python dependencies listed in requirements.txt or pyproject.toml.
```
uv sync
```
(Note: Make sure you have a requirements.txt or pyproject.toml file with mlx-lm listed).
Make the Script Executable: Grant execution permissions to the run script.
```
chmod +x run_mlx_openwebui.sh
```

Running the Server and UI

Execute the Run Script: Run the script, providing the name of the MLX model you want to load as an argument. Replace <your_model_name> with the actual model identifier (e.g., mlx-community/Mistral-7B-Instruct-v0.2).
```
./run_mlx_openwebui.sh <your_model_name>
```
This script will:
- Start the Open WebUI container in the background using docker compose. The docker-compose.yml file maps host port 3000 to the container's port 8080 and uses host.docker.internal to allow the container to communicate back to the MLX server running on your host machine.
- Open a new terminal window and start the MLX LM server, loading the specified model and listening on port 8000.
Access Open WebUI: Open your web browser and navigate to http://localhost:3000 (as mapped in the docker-compose.yml).
Configure Open WebUI:
- In Open WebUI, go to Settings -> Connections.
- Set the API Base URL to http://host.docker.internal:8000/v1. (Open WebUI, running inside Docker, will use this URL to connect to the MLX server running on your host machine thanks to the extra_hosts setting in docker-compose.yml).
- You should now be able to select and interact with your locally running model.

Available MLX Models for MacBook Pro 64GB RAM

This comprehensive list includes models optimized for Apple Silicon using the MLX framework. All models are compatible with mlx-lm and have been tested on MacBook Pro systems with 64GB unified memory.

Memory Requirements Guide

Small Models (1-8B): 3-10 GB RAM (4-bit quantized)
Medium Models (8-14B): 10-20 GB RAM (4-bit quantized)
Large Models (14-32B): 20-35 GB RAM (4-bit quantized)
Extra Large Models (32-70B): 35-50 GB RAM (4-bit quantized)

With 64GB RAM, you can run models up to 70B parameters (4-bit quantized) or run multiple smaller models simultaneously.

🏆 Recommended Models for 64GB Systems

Best Overall Performance

mlx-community/Llama-3.3-70B-Instruct-4bit (40GB) - GPT-4 class performance
mlx-community/DeepSeek-R1-Distill-Qwen-32B-4bit (18.5GB) - Excellent reasoning capabilities

Best Coding Models

mlx-community/Qwen2.5-Coder-32B-Instruct-4bit (~18GB) - Top coding performance, 92 languages
mlx-community/DeepSeek-Coder-V2-Instruct-4bit - Specialized for code generation
mlx-community/Qwen3-Coder-Flash (30.5B MoE) - Fast coding with MoE architecture

Best Balanced Models (Quality + Speed)

lmstudio-community/Mistral-Small-3.2-24B-Instruct-2506-MLX-4bit (~12GB) - GPT-4 quality
mlx-community/Qwen3-8B-4bit - Excellent general performance, compact size
mlx-community/Meta-Llama-3.1-8B-Instruct-4bit - Reliable and fast

📚 Complete Model Catalog by Family

Llama Family

Llama 3.3 (70B)

mlx-community/Llama-3.3-70B-Instruct-4bit (40GB)

Llama 3.2 (Vision-capable)

mlx-community/Llama-3.2-11B-Vision-Instruct-4bit (~7GB) - Multimodal
mlx-community/Llama-3.2-90B-Vision-Instruct-4bit (~50GB) - Large multimodal

Llama 3.1

mlx-community/Meta-Llama-3.1-8B-Instruct-4bit (~5GB)
mlx-community/Meta-Llama-3.1-70B-Instruct-4bit (~40GB)

Llama 3

mlx-community/Meta-Llama-3-8B-Instruct-4bit (~5GB)
mlx-community/Meta-Llama-3-70B-Instruct-4bit (~40GB)

Qwen Family

Qwen 3 (Latest)

mlx-community/Qwen3-235B-A22B-8bit - Massive MoE model
mlx-community/Qwen3-8B-4bit (~5GB)
mlx-community/Qwen3-4B-4bit (~3GB)
mlx-community/Qwen3-1.5B-4bit (~2GB)

Qwen 3 Coder

mlx-community/Qwen3-Coder-Flash (30.5B MoE, 3.3B active)
mlx-community/Qwen3-Coder-32B-4bit

Qwen 2.5

mlx-community/Qwen2.5-72B-Instruct-4bit (~40GB)
mlx-community/Qwen2.5-32B-Instruct-4bit (~18GB)
mlx-community/Qwen2.5-14B-Instruct-4bit (~8GB)
mlx-community/Qwen2.5-7B-Instruct-4bit (~4GB)

Qwen 2.5 Coder (Specialized for Coding)

mlx-community/Qwen2.5-Coder-32B-Instruct-4bit (~18GB)
mlx-community/Qwen2.5-Coder-14B-Instruct-4bit (~8GB)
mlx-community/Qwen2.5-Coder-7B-Instruct-4bit (~4GB)

Qwen 2.5 Vision

mlx-community/Qwen2.5-VL-3B-Instruct-4bit - Multimodal capabilities

QwQ (Reasoning)

mlx-community/QwQ-32B-Preview-4bit (18.5GB, 32K context)

DeepSeek Family

DeepSeek R1 (Reasoning Models)

lmstudio-community/DeepSeek-R1-0528-Qwen3-8B-MLX-4bit
mlx-community/DeepSeek-R1-Distill-Qwen-32B-4bit (18.5GB)
mlx-community/DeepSeek-R1-Distill-Qwen-14B-4bit
mlx-community/DeepSeek-R1-Distill-Qwen-1.5B-4bit (both 4-bit and 8-bit)

DeepSeek V3

mlx-community/DeepSeek-V3.1-4bit

DeepSeek Coder

mlx-community/DeepSeek-Coder-V2-Instruct-4bit
Models range from 1.3B to 33B parameters

Mistral Family

Mistral Small

mlx-community/Mistral-Small-24B-Instruct-2501-4bit (~12GB)
lmstudio-community/Mistral-Small-3.2-24B-Instruct-2506-MLX-4bit
lmstudio-community/Mistral-Small-3.2-24B-Instruct-2506-MLX-8bit
mlx-community/Mistral-Small-Instruct-2409-4bit

Mistral 7B

mlx-community/Mistral-7B-Instruct-v0.3-4bit (Default in run_mlx_openwebui.sh)
mlx-community/Mistral-7B-Instruct-v0.2-4-bit
mlx-community/Mistral-7B-v0.1-hf-4bit-mlx

Pixtral (Vision)

mlx-community/pixtral-12b-4bit (7.15GB) - Multimodal, 128K context

Phi Family (Microsoft)

Phi-4 (14B - Latest)

mlx-community/phi-4-4bit (~8GB)
mlx-community/phi-4-8bit (~14GB)
lmstudio-community/Phi-4-reasoning-MLX-4bit
lmstudio-community/Phi-4-mini-reasoning-MLX-4bit

Phi-3

mlx-community/Phi-3-mini-4k-instruct-4bit (~2GB)
mlx-community/Phi-3-mini-4k-instruct-8bit (~4GB)

Gemma Family (Google)

Gemma 3

mlx-community/gemma-3n-E4B-it-lm-4bit (MoE model)
lmstudio-community/gemma-3n-E4B-it-MLX-4bit
lmstudio-community/gemma-3n-E4B-it-MLX-8bit
mlx-community/gemma-3-4b-it-8bit
mlx-community/gemma-3-4b-pt-4bit
mlx-community/gemma-3-1b-it-4bit

Other Notable Models

StableLM

mlx-community/stablelm-2-zephyr-1_6b-4bit (~1GB)

H2O Danube

ucheog/h2o-danube2-1.8b-chat-MLX-4bit (~1GB)

DBRX

Quantized versions available (requires 64GB+ RAM)

🎨 Multimodal (Vision) Models

These models can process both images and text:

mlx-community/Llama-3.2-11B-Vision-Instruct-4bit (~7GB)
mlx-community/Llama-3.2-90B-Vision-Instruct-4bit (~50GB)
mlx-community/pixtral-12b-4bit (7.15GB) - Mistral's vision model
mlx-community/Qwen2.5-VL-3B-Instruct-4bit - Qwen vision variant

📊 Understanding Quantization

4-bit quantization: Best balance of quality and size (recommended for 64GB systems)

Typically uses ~2.5GB per 7B parameters
Minimal quality loss for most tasks

8-bit quantization: Higher quality, larger size

Typically uses ~5GB per 7B parameters
Better for tasks requiring high precision

BF16/FP16: Full precision

Maximum quality but 4x larger than 4-bit
Only for smaller models on 64GB systems

🔍 Finding More Models

Official Sources:

mlx-community on Hugging Face - 1,000+ models
MLX models filter
lmstudio-community - Additional MLX conversions

Collections by Model Family:

Tips:

Look for -4bit or -8bit in model names for quantized versions
Models with -MLX suffix are optimized for Apple Silicon
Check model cards for memory requirements and benchmarks
MoE (Mixture of Experts) models can run larger parameter counts efficiently

⚡ Performance Tips for 64GB Systems

Use 4-bit quantization for the best balance of performance and quality
For coding tasks: Qwen2.5-Coder-32B-4bit offers the best results
For reasoning: DeepSeek-R1-Distill-Qwen-32B-4bit or QwQ-32B-Preview-4bit
For general use: Llama-3.3-70B-Instruct-4bit (if you have memory) or Mistral-Small-24B
Run multiple models: You can run 2-3 smaller models (7B-8B) simultaneously
Use llm chat for larger models to keep them in memory between conversations

Stopping Services

Stop MLX Server: Close the terminal window where the uv run mlx_lm.server... command is running, or press Ctrl+C in that window.
Stop Open WebUI: Run the following command in the project directory (using the modern docker compose syntax):
```
docker compose down
```

Name		Name	Last commit message	Last commit date
Latest commit History 8 Commits
.flox		.flox
.gitignore		.gitignore
.python-version		.python-version
README.md		README.md
docker-compose.yml		docker-compose.yml
pyproject.toml		pyproject.toml
run_mlx_openwebui.sh		run_mlx_openwebui.sh
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

LLM Local Server with MLX and Open WebUI

Prerequisites

Setup

Running the Server and UI

Available MLX Models for MacBook Pro 64GB RAM

Memory Requirements Guide

🏆 Recommended Models for 64GB Systems

Best Overall Performance

Best Coding Models

Best Balanced Models (Quality + Speed)

📚 Complete Model Catalog by Family

Llama Family

Qwen Family

DeepSeek Family

Mistral Family

Phi Family (Microsoft)

Gemma Family (Google)

Other Notable Models

🎨 Multimodal (Vision) Models

📊 Understanding Quantization

🔍 Finding More Models

⚡ Performance Tips for 64GB Systems

Stopping Services

About

Uh oh!

Releases

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

LLM Local Server with MLX and Open WebUI

Prerequisites

Setup

Running the Server and UI

Available MLX Models for MacBook Pro 64GB RAM

Memory Requirements Guide

🏆 Recommended Models for 64GB Systems

Best Overall Performance

Best Coding Models

Best Balanced Models (Quality + Speed)

📚 Complete Model Catalog by Family

Llama Family

Qwen Family

DeepSeek Family

Mistral Family

Phi Family (Microsoft)

Gemma Family (Google)

Other Notable Models

🎨 Multimodal (Vision) Models

📊 Understanding Quantization

🔍 Finding More Models

⚡ Performance Tips for 64GB Systems

Stopping Services

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages