|
| 1 | +## 1. Pre-Requisites: Make a virtual environment and install ONNX Runtime GenAI |
| 2 | +```bash |
| 3 | +# Installing onnxruntime-genai, olive, and dependencies for CPU |
| 4 | +python -m venv .venv && source .venv/bin/activate |
| 5 | +pip install requests numpy --pre onnxruntime-genai olive-ai" |
| 6 | +``` |
| 7 | +
|
| 8 | +```bash |
| 9 | +# Installing onnxruntime-genai, olive, and dependencies for CUDA GPU |
| 10 | +python -m venv .venv && source .venv/bin/activate |
| 11 | +pip install requests numpy --pre onnxruntime-genai-cuda "olive-ai[gpu]" |
| 12 | +``` |
| 13 | +
|
| 14 | +## 2. Acquire model |
| 15 | +
|
| 16 | +Choose your model and convert to ONNX. Note that many LLMs work, so feel free to try with other models too: |
| 17 | +
|
| 18 | +```bash |
| 19 | +# Using Olive auto-opt to pull a huggingface model, optimize for CPU, and quantize to INT4 using RTN. |
| 20 | +olive auto-opt --model_name_or_path deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B --output_path ./deepseek-r1-distill-qwen-1.5B --device cpu --provider CPUExecutionProvider --precision int4 --use_model_builder --log_level 1 |
| 21 | +``` |
| 22 | +
|
| 23 | +```bash |
| 24 | +# Using Olive auto-opt to pull a huggingface model, optimize for CUDA GPUs, and quantize to INT4 using RTN. |
| 25 | +olive auto-opt --model_name_or_path deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B --output_path ./deepseek-r1-distill-qwen-1.5B --device gpu --provider CUDAExecutionProvider --precision int4 --use_model_builder --log_level 1 |
| 26 | +``` |
| 27 | +
|
| 28 | +OR download directly using the Huggingface CLI: |
| 29 | +
|
| 30 | +```bash |
| 31 | +# Download the model directly using the huggingface cli |
| 32 | +huggingface-cli download onnxruntime/DeepSeek-R1-Distill-ONNX --include 'deepseek-r1-distill-qwen-1.5B/*' --local-dir . |
| 33 | +``` |
| 34 | +
|
| 35 | +## 3. Play with your model on device! |
| 36 | +```bash |
| 37 | +# CPU Chat inference. If you pulled the model from huggingface, adjust the model directory (-m) accordingly |
| 38 | +curl -o https://raw.githubusercontent.com/microsoft/onnxruntime-genai/refs/heads/main/examples/python/model-chat.py |
| 39 | +python model-chat.py -m deepseek-r1-distill-qwen-1.5B/model -e cpu --chat_template "<|begin▁of▁sentence|><|User|>{input}<|Assistant|>" |
| 40 | +``` |
| 41 | +
|
| 42 | +```bash |
| 43 | +# GPU Chat inference. If you pulled the model from huggingface, adjust the model directory (-m) accordingly |
| 44 | +curl -o https://raw.githubusercontent.com/microsoft/onnxruntime-genai/refs/heads/main/examples/python/model-chat.py |
| 45 | +python model-chat.py -m deepseek-r1-distill-qwen-1.5B/model -e cuda --chat_template "<|begin▁of▁sentence|><|User|>{input}<|Assistant|>" |
0 commit comments