You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
source .venv/bin/activate
# Download the model from Hugging Face Hub
python model_download.py --repo_id "mlx-community/gemma-2-9b-it-4bit"# Download the model from Hugging Face Hub with custom cache directory
python model_download.py --repo_id "mlx-community/gemma-2-9b-it-4bit" --cache_dir "/tmp/huggingface/hub"# Download the model from Hugging Face Hub with custom hugging face token
python model_download.py --repo_id "mlx-community/gemma-2-9b-it-4bit" --token "YOUR_HUGGING_FACE_API_TOKEN"
Streaming Inference
Args
Type
Required
Default
Description
-m, --model
str
Required
Path to the model
--prompt
str
Required
Prompt for the LLM model
--max_tokens
int
Optional
512
Maximum tokens to generate
--verbose
bool
Optional
Verbose mode
source .venv/bin/activate
# Run the stream inference with default values
python inference.py
# Run the stream inference with verbose mode
python inference.py --verbose
# Run the stream inference with custom model
python inference.py --model "mlx-community/DeepSeek-Coder-V2-Lite-Instruct-4bit-mlx"# Run the stream inference with custom prompt
python inference.py --prompt "What is the capital of France?"# Run the stream inference with custom max tokens
python inference.py --max_tokens 1024
Convert Hugging Face Model to MLX Model Format
Args
Type
Required
Default
Description
-m, --model
str
Required
Path to the model
--quantize
bool
Required
Whether Quantize model
--quantize_level
int
Optional
4
Quantize level (bits)
--verbose
bool
Optional
Verbose mode
source .venv/bin/activate
# Convert Hugging Face model to MLX model format
python convert.py --model "google/gemma-2-9b-it"# Convert Hugging Face model to MLX model format with verbose mode
python convert.py --model "google/gemma-2-9b-it" --verbose
# Convert Hugging Face model to MLX model format with quantization
python convert.py --model "google/gemma-2-9b-it" --quantize
# Convert Hugging Face model to MLX model format with quantization and custom quantize level
python convert.py --model "google/gemma-2-9b-it" --quantize --quantize_level 8
About
LLM model inference on Apple Silicon Mac using the Apple MLX Framework.