Skip to content

Commit 06e91ac

Browse files
MaanavDTed Themistokleous
authored andcommitted
Uploaded deepseek blog, ready for post. (microsoft#23740)
Preview link available at: https://maanavd.github.io/onnxruntime/blogs
1 parent 65f7d2d commit 06e91ac

File tree

4 files changed

+100
-4
lines changed

4 files changed

+100
-4
lines changed

docs/genai/tutorials/deepseek-python.md

Lines changed: 2 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -53,4 +53,5 @@ python model-chat.py -m deepseek-r1-distill-qwen-1.5B/model -e cpu --chat_templa
5353
```bash
5454
# On-Device GPU Chat inference. Works on devices with Nvidia GPUs. If you pulled the model from huggingface, adjust the model directory (-m) accordingly
5555
curl -o https://raw.githubusercontent.com/microsoft/onnxruntime-genai/refs/heads/main/examples/python/model-chat.py
56-
python model-chat.py -m deepseek-r1-distill-qwen-1.5B/model -e cuda --chat_template "<|begin▁of▁sentence|><|User|>{input}<|Assistant|>"
56+
python model-chat.py -m deepseek-r1-distill-qwen-1.5B/model -e cuda --chat_template "<|begin▁of▁sentence|><|User|>{input}<|Assistant|>"
57+
```

src/images/blogs/DeepSeek.png

77.1 KB
Loading

src/routes/blogs/+page.svelte

Lines changed: 14 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -23,6 +23,7 @@
2323
import OliveCli from '../../images/blogs/olive-flow.png';
2424
import QuantizeFinetune from '../../images/blogs/Quantize-finetune.jpg';
2525
import MultiLoraThumbnail from '../../images/blogs/multilora.png';
26+
import DeepSeekR1Thumbnail from '../../images/blogs/DeepSeek.png';
2627
import ORTLogo from '../../images/ONNX-Icon.svg';
2728
onMount(() => {
2829
anime({
@@ -51,6 +52,15 @@
5152
dispatch('switchTab', tab);
5253
}
5354
let featuredblog = [
55+
{
56+
title: 'Enhancing DeepSeek R1 performance for on-device inference with ONNX Runtime.',
57+
date: '19th February, 2025',
58+
blurb:
59+
"Enhance your AI inferencing performance with DeepSeek R1 optimized for on-device use via ONNX Runtime! This blog explores how to efficiently run DeepSeek models across NPUs, GPUs, and CPUs, achieving up to 6.3x speed improvements over PyTorch. Learn how to convert, quantize, and fine-tune these models using the Olive framework and Azure AI Foundry.",
60+
link: 'blogs/deepseek-r1-on-device',
61+
image: DeepSeekR1Thumbnail,
62+
imgalt: 'DeepSeek R1 On Device using ONNX Runtime Gen AI'
63+
},
5464
{
5565
title: 'Cross-Platform Edge AI Made Easy with ONNX Runtime',
5666
date: '19th November, 2024',
@@ -69,6 +79,9 @@
6979
image: MultiLoraThumbnail,
7080
imgalt: 'Serving LoRA models separately vs with MultiLoRA'
7181
},
82+
83+
];
84+
let blogs = [
7285
{
7386
title: 'Is it better to quantize before or after finetuning?',
7487
date: '19th November, 2024',
@@ -77,9 +90,7 @@
7790
link: 'blogs/olive-quant-ft',
7891
image: QuantizeFinetune,
7992
imgalt: 'Quantize or finetune first for better model performance?'
80-
}
81-
];
82-
let blogs = [
93+
},
8394
{
8495
title:
8596
'Scribble to Erase on Goodnotes for Windows, Web, and Android, Powered by ONNX Runtime',
Lines changed: 84 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,84 @@
1+
---
2+
title: Enhancing DeepSeek R1 performance for on-device inference with ONNX Runtime.
3+
date: '19th Februrary, 2025'
4+
description: 'Boost DeepSeek R1 performance on-device with ONNX Runtime, achieving faster inference across CPU, GPU, and NPU.'
5+
keywords: 'DeepSeek R1 optimization, ONNX Runtime performance, AI inferencing on-device, GPU and CPU model acceleration, Quantizing AI models with Olive, Azure AI Foundry model catalog, ONNX Generative API, AI development best practices, Faster PyTorch alternatives, Model deployment on Copilot+ PCs'
6+
authors: ['Parinita Rahi', 'Sunghoon Choi', 'Kunal Vaishnavi', 'Maanav Dalal']
7+
authorsLink: ['https://www.linkedin.com/in/parinitaparinita/', 'https://www.linkedin.com/in/sunghoon/', 'https://www.linkedin.com/in/kunal-v-16315b94/', 'https://www.linkedin.com/in/maanavdalal/']
8+
image: 'https://iili.io/2yV40bV.png'
9+
imageSquare: 'https://iili.io/2yV40bV.png'
10+
url: 'https://onnxruntime.ai/blogs/deepseek-r1-on-device'
11+
---
12+
Are you a developer looking to harness the power of your users' local compute for AI inferencing on PCs with NPUs, GPUs, and CPUs? Look no further!
13+
14+
With the new release you can now run these models on CPU and GPU. You can now download and run the ONNX optimized variants of the models from [Hugging Face](https://huggingface.co/onnxruntime/DeepSeek-R1-Distill-ONNX). Additionally, you can also these models on NPU: [Windows Developer Blog](https://blogs.windows.com/windowsdeveloper/2025/01/29/running-distilled-deepseek-r1-models-locally-on-copilot-pcs-powered-by-windows-copilot-runtime/).
15+
16+
17+
18+
## Download and run your models easily!
19+
The DeepSeek ONNX models enables you to run DeepSeek on any GPU or CPU, achieving performance speeds 1.3 to 6.3 times faster than native PyTorch. To easily get started with the model, you can use our ONNX Runtime `Generate()` API.
20+
<!-- Video Embed -->
21+
<div>
22+
<iframe
23+
class="pb-2 w-full"
24+
height="600px"
25+
src="https://www.youtube.com/embed/s63vSd8ZI5g"
26+
title="YouTube video player"
27+
frameborder="0"
28+
allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture; web-share"
29+
allowfullscreen
30+
/>
31+
</div>
32+
33+
### Quickstart on CPU
34+
35+
36+
Installing onnxruntime-genai and dependencies for CPU in a virtual environment:
37+
```python
38+
python -m venv .venv && source .venv/bin/activate
39+
pip install requests numpy --pre onnxruntime-genai
40+
```
41+
42+
Download the model directly using the huggingface cli:
43+
```python
44+
huggingface-cli download onnxruntime/DeepSeek-R1-Distill-ONNX --include "deepseek-r1-distill-qwen-1.5B/*" --local-dir ./
45+
```
46+
47+
CPU Chat inference. If you pulled the model from huggingface, adjust the model directory (-m) accordingly:
48+
```python
49+
wget https://raw.githubusercontent.com/microsoft/onnxruntime-genai/refs/heads/main/examples/python/model-chat.py
50+
python model-chat.py -m deepseek-r1-distill-qwen-1.5B/cpu_and_mobile/cpu-int4-rtn-block-32-acc-level-4 -e cpu
51+
```
52+
53+
See instructions for GPU (CUDA, DML) [here](https://github.com/microsoft/onnxruntime/blob/gh-pages/docs/genai/tutorials/deepseek-python.md).
54+
## ONNX Model Performance Improvements
55+
56+
ONNX enables you to run your models on-device across CPU, GPU, NPU. With ONNX you can run your models on any machine across all silica Qualcomm, AMD, Intel, Nvidia. See table below for some key benchmarks for Windows GPU and CPU devices.
57+
58+
| Model | Precision | Execution Provider | Device | Token Generation Throughput | Speed Up vs PyTorch |
59+
| ----------------------------------------- | ------------------ | ------------------ | -------- | --------------------------- | ------------------- |
60+
| deepseek-ai_DeepSeek-R1-Distill-Qwen-1.5B | fp16 | CUDA | RTX 4090 | 197.195 | 4X |
61+
| deepseek-ai_DeepSeek-R1-Distill-Qwen-1.5B | Int4 | CUDA | RTX 4090 | 313.32 | 6.3X |
62+
| deepseek-ai_DeepSeek-R1-Distill-Qwen-7B | fp16 | CUDA | RTX 4090 | 57.316 | 1.3X |
63+
| deepseek-ai_DeepSeek-R1-Distill-Qwen-7B | Int4 | CUDA | RTX 4090 | 161.00 | 3.7X |
64+
| deepseek-ai_DeepSeek-R1-Distill-Qwen-7B | Int4 | CPU | 13th Gen Intel i9 | 3.184 | 20X |
65+
| deepseek-ai_DeepSeek-R1-Distill-Qwen-1.5B | Int4 | CPU | 13th Gen Intel i9 | 11.749 | 1.4X |
66+
67+
_CUDA BUILD SPECS: onnxruntime-genai-cuda==0.6.0, transformers==4.46.2, onnxruntime-gpu==1.20.1_ <br/>
68+
_CPU BUILD SPECS: onnxruntime-genai==0.6.0, transformers==4.46.2, onnxruntime==1.20.01_
69+
70+
## Easily Finetune your models with Olive.
71+
72+
This [notebook](https://github.com/microsoft/Olive/blob/main/examples/getting_started/olive-deepseek-finetune.ipynb) provides a step-by-step guide to fine-tuning DeepSeek models using the Olive framework. It covers the process of setting up your environment, preparing your data, and leveraging Azure AI Foundry to optimize and deploy your models. The notebook is designed to help you get started quickly and efficiently with DeepSeek and Olive, making your AI development process smoother and more effective.
73+
74+
75+
## Conclusion
76+
77+
Optimizing DeepSeek R1 distilled models with ONNX Runtime can lead to significant performance improvements. These optimized models are coming soon via Azure AI Foundry and can be easily accessed via the command line or the [VS Code AI Toolkit](https://code.visualstudio.com/docs/intelligentapps/overview).
78+
79+
By leveraging our AI framework solution with Azure Foundry, AI Toolkit, Olive, and ONNX Runtime you get your end-to-end solution for model development experience. Stay tuned for more updates and best practices on enhancing AI model performance.
80+
<style>
81+
a {
82+
text-decoration: underline;
83+
}
84+
</style>

0 commit comments

Comments
 (0)