This project sets up a local environment to run Large Language Models (LLMs) using the MLX framework and interact with them through the Open WebUI interface.
- Docker and Docker Compose
- Flox
- uv (will be installed via Flox)
-
Install Flox: If you haven't already, install Flox by following the instructions here.
-
Activate Flox Environment: Navigate to the project directory (
llm-local-server) in your terminal and activate the Flox environment. This will installuvand other necessary tools defined in theflox.nixfile (if one exists).cd llm-local-server flox activate -
Install Python Dependencies: Use
uvto sync the Python dependencies listed inrequirements.txtorpyproject.toml.uv sync
(Note: Make sure you have a
requirements.txtorpyproject.tomlfile withmlx-lmlisted). -
Make the Script Executable: Grant execution permissions to the run script.
chmod +x run_mlx_openwebui.sh
-
Execute the Run Script: Run the script, providing the name of the MLX model you want to load as an argument. Replace
<your_model_name>with the actual model identifier (e.g.,mlx-community/Mistral-7B-Instruct-v0.2)../run_mlx_openwebui.sh <your_model_name>
This script will:
- Start the Open WebUI container in the background using
docker compose. Thedocker-compose.ymlfile maps host port 3000 to the container's port 8080 and useshost.docker.internalto allow the container to communicate back to the MLX server running on your host machine. - Open a new terminal window and start the MLX LM server, loading the specified model and listening on port 8000.
- Start the Open WebUI container in the background using
-
Access Open WebUI: Open your web browser and navigate to
http://localhost:3000(as mapped in thedocker-compose.yml). -
Configure Open WebUI:
- In Open WebUI, go to Settings -> Connections.
- Set the API Base URL to
http://host.docker.internal:8000/v1. (Open WebUI, running inside Docker, will use this URL to connect to the MLX server running on your host machine thanks to theextra_hostssetting indocker-compose.yml). - You should now be able to select and interact with your locally running model.
This comprehensive list includes models optimized for Apple Silicon using the MLX framework. All models are compatible with mlx-lm and have been tested on MacBook Pro systems with 64GB unified memory.
- Small Models (1-8B): 3-10 GB RAM (4-bit quantized)
- Medium Models (8-14B): 10-20 GB RAM (4-bit quantized)
- Large Models (14-32B): 20-35 GB RAM (4-bit quantized)
- Extra Large Models (32-70B): 35-50 GB RAM (4-bit quantized)
With 64GB RAM, you can run models up to 70B parameters (4-bit quantized) or run multiple smaller models simultaneously.
mlx-community/Llama-3.3-70B-Instruct-4bit(40GB) - GPT-4 class performancemlx-community/DeepSeek-R1-Distill-Qwen-32B-4bit(18.5GB) - Excellent reasoning capabilities
mlx-community/Qwen2.5-Coder-32B-Instruct-4bit(~18GB) - Top coding performance, 92 languagesmlx-community/DeepSeek-Coder-V2-Instruct-4bit- Specialized for code generationmlx-community/Qwen3-Coder-Flash(30.5B MoE) - Fast coding with MoE architecture
lmstudio-community/Mistral-Small-3.2-24B-Instruct-2506-MLX-4bit(~12GB) - GPT-4 qualitymlx-community/Qwen3-8B-4bit- Excellent general performance, compact sizemlx-community/Meta-Llama-3.1-8B-Instruct-4bit- Reliable and fast
Llama 3.3 (70B)
mlx-community/Llama-3.3-70B-Instruct-4bit(40GB)
Llama 3.2 (Vision-capable)
mlx-community/Llama-3.2-11B-Vision-Instruct-4bit(~7GB) - Multimodalmlx-community/Llama-3.2-90B-Vision-Instruct-4bit(~50GB) - Large multimodal
Llama 3.1
mlx-community/Meta-Llama-3.1-8B-Instruct-4bit(~5GB)mlx-community/Meta-Llama-3.1-70B-Instruct-4bit(~40GB)
Llama 3
mlx-community/Meta-Llama-3-8B-Instruct-4bit(~5GB)mlx-community/Meta-Llama-3-70B-Instruct-4bit(~40GB)
Qwen 3 (Latest)
mlx-community/Qwen3-235B-A22B-8bit- Massive MoE modelmlx-community/Qwen3-8B-4bit(~5GB)mlx-community/Qwen3-4B-4bit(~3GB)mlx-community/Qwen3-1.5B-4bit(~2GB)
Qwen 3 Coder
mlx-community/Qwen3-Coder-Flash(30.5B MoE, 3.3B active)mlx-community/Qwen3-Coder-32B-4bit
Qwen 2.5
mlx-community/Qwen2.5-72B-Instruct-4bit(~40GB)mlx-community/Qwen2.5-32B-Instruct-4bit(~18GB)mlx-community/Qwen2.5-14B-Instruct-4bit(~8GB)mlx-community/Qwen2.5-7B-Instruct-4bit(~4GB)
Qwen 2.5 Coder (Specialized for Coding)
mlx-community/Qwen2.5-Coder-32B-Instruct-4bit(~18GB)mlx-community/Qwen2.5-Coder-14B-Instruct-4bit(~8GB)mlx-community/Qwen2.5-Coder-7B-Instruct-4bit(~4GB)
Qwen 2.5 Vision
mlx-community/Qwen2.5-VL-3B-Instruct-4bit- Multimodal capabilities
QwQ (Reasoning)
mlx-community/QwQ-32B-Preview-4bit(18.5GB, 32K context)
DeepSeek R1 (Reasoning Models)
lmstudio-community/DeepSeek-R1-0528-Qwen3-8B-MLX-4bitmlx-community/DeepSeek-R1-Distill-Qwen-32B-4bit(18.5GB)mlx-community/DeepSeek-R1-Distill-Qwen-14B-4bitmlx-community/DeepSeek-R1-Distill-Qwen-1.5B-4bit(both 4-bit and 8-bit)
DeepSeek V3
mlx-community/DeepSeek-V3.1-4bit
DeepSeek Coder
mlx-community/DeepSeek-Coder-V2-Instruct-4bit- Models range from 1.3B to 33B parameters
Mistral Small
mlx-community/Mistral-Small-24B-Instruct-2501-4bit(~12GB)lmstudio-community/Mistral-Small-3.2-24B-Instruct-2506-MLX-4bitlmstudio-community/Mistral-Small-3.2-24B-Instruct-2506-MLX-8bitmlx-community/Mistral-Small-Instruct-2409-4bit
Mistral 7B
mlx-community/Mistral-7B-Instruct-v0.3-4bit(Default inrun_mlx_openwebui.sh)mlx-community/Mistral-7B-Instruct-v0.2-4-bitmlx-community/Mistral-7B-v0.1-hf-4bit-mlx
Pixtral (Vision)
mlx-community/pixtral-12b-4bit(7.15GB) - Multimodal, 128K context
Phi-4 (14B - Latest)
mlx-community/phi-4-4bit(~8GB)mlx-community/phi-4-8bit(~14GB)lmstudio-community/Phi-4-reasoning-MLX-4bitlmstudio-community/Phi-4-mini-reasoning-MLX-4bit
Phi-3
mlx-community/Phi-3-mini-4k-instruct-4bit(~2GB)mlx-community/Phi-3-mini-4k-instruct-8bit(~4GB)
Gemma 3
mlx-community/gemma-3n-E4B-it-lm-4bit(MoE model)lmstudio-community/gemma-3n-E4B-it-MLX-4bitlmstudio-community/gemma-3n-E4B-it-MLX-8bitmlx-community/gemma-3-4b-it-8bitmlx-community/gemma-3-4b-pt-4bitmlx-community/gemma-3-1b-it-4bit
StableLM
mlx-community/stablelm-2-zephyr-1_6b-4bit(~1GB)
H2O Danube
ucheog/h2o-danube2-1.8b-chat-MLX-4bit(~1GB)
DBRX
- Quantized versions available (requires 64GB+ RAM)
These models can process both images and text:
mlx-community/Llama-3.2-11B-Vision-Instruct-4bit(~7GB)mlx-community/Llama-3.2-90B-Vision-Instruct-4bit(~50GB)mlx-community/pixtral-12b-4bit(7.15GB) - Mistral's vision modelmlx-community/Qwen2.5-VL-3B-Instruct-4bit- Qwen vision variant
4-bit quantization: Best balance of quality and size (recommended for 64GB systems)
- Typically uses ~2.5GB per 7B parameters
- Minimal quality loss for most tasks
8-bit quantization: Higher quality, larger size
- Typically uses ~5GB per 7B parameters
- Better for tasks requiring high precision
BF16/FP16: Full precision
- Maximum quality but 4x larger than 4-bit
- Only for smaller models on 64GB systems
Official Sources:
- mlx-community on Hugging Face - 1,000+ models
- MLX models filter
- lmstudio-community - Additional MLX conversions
Collections by Model Family:
- Qwen 3 Collection
- Qwen 2.5 Collection
- Qwen 2.5 Coder Collection
- Qwen 3 Coder MoE Collection
- Mistral Collection
- Phi-3 Collection
- Gemma 3 Collection
Tips:
- Look for
-4bitor-8bitin model names for quantized versions - Models with
-MLXsuffix are optimized for Apple Silicon - Check model cards for memory requirements and benchmarks
- MoE (Mixture of Experts) models can run larger parameter counts efficiently
- Use 4-bit quantization for the best balance of performance and quality
- For coding tasks: Qwen2.5-Coder-32B-4bit offers the best results
- For reasoning: DeepSeek-R1-Distill-Qwen-32B-4bit or QwQ-32B-Preview-4bit
- For general use: Llama-3.3-70B-Instruct-4bit (if you have memory) or Mistral-Small-24B
- Run multiple models: You can run 2-3 smaller models (7B-8B) simultaneously
- Use
llm chatfor larger models to keep them in memory between conversations
-
Stop MLX Server: Close the terminal window where the
uv run mlx_lm.server...command is running, or pressCtrl+Cin that window. -
Stop Open WebUI: Run the following command in the project directory (using the modern
docker composesyntax):docker compose down