Skip to content
Merged
Show file tree
Hide file tree
Changes from 5 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Binary file added src/images/blogs/DeepSeek.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
17 changes: 14 additions & 3 deletions src/routes/blogs/+page.svelte
Original file line number Diff line number Diff line change
Expand Up @@ -23,6 +23,7 @@
import OliveCli from '../../images/blogs/olive-flow.png';
import QuantizeFinetune from '../../images/blogs/Quantize-finetune.jpg';
import MultiLoraThumbnail from '../../images/blogs/multilora.png';
import DeepSeekR1Thumbnail from '../../images/blogs/DeepSeek.png';
import ORTLogo from '../../images/ONNX-Icon.svg';
onMount(() => {
anime({
Expand Down Expand Up @@ -51,6 +52,15 @@
dispatch('switchTab', tab);
}
let featuredblog = [
{
title: 'Enhancing DeepSeek R1 performance for on-device inference with ONNX Runtime.',
date: '19th February, 2025',
blurb:
"Enhance your AI inferencing performance with DeepSeek R1 optimized for on-device use via ONNX Runtime! This blog explores how to efficiently run DeepSeek models across NPUs, GPUs, and CPUs, achieving up to 6.3x speed improvements over PyTorch. Learn how to convert, quantize, and fine-tune these models using the Olive framework and Azure AI Foundry.",
link: 'blogs/deepseek-r1-on-device',
image: DeepSeekR1Thumbnail,
imgalt: 'DeepSeek R1 On Device using ONNX Runtime Gen AI'
},
{
title: 'Cross-Platform Edge AI Made Easy with ONNX Runtime',
date: '19th November, 2024',
Expand All @@ -69,6 +79,9 @@
image: MultiLoraThumbnail,
imgalt: 'Serving LoRA models separately vs with MultiLoRA'
},

];
let blogs = [
{
title: 'Is it better to quantize before or after finetuning?',
date: '19th November, 2024',
Expand All @@ -77,9 +90,7 @@
link: 'blogs/olive-quant-ft',
image: QuantizeFinetune,
imgalt: 'Quantize or finetune first for better model performance?'
}
];
let blogs = [
},
{
title:
'Scribble to Erase on Goodnotes for Windows, Web, and Android, Powered by ONNX Runtime',
Expand Down
84 changes: 84 additions & 0 deletions src/routes/blogs/deepseek-r1-on-device/+page.svx
Original file line number Diff line number Diff line change
@@ -0,0 +1,84 @@
---
title: Enhancing DeepSeek R1 performance for on-device inference with ONNX Runtime.
date: '19th Februrary, 2025'
description: 'Boost DeepSeek R1 performance on-device with ONNX Runtime, achieving faster inference across CPU, GPU, and NPU.'
keywords: 'DeepSeek R1 optimization, ONNX Runtime performance, AI inferencing on-device, GPU and CPU model acceleration, Quantizing AI models with Olive, Azure AI Foundry model catalog, ONNX Generative API, AI development best practices, Faster PyTorch alternatives, Model deployment on Copilot+ PCs'
authors: ['Parinita Rahi', 'Sunghoon Choi', 'Kunal Vaishnavi', 'Maanav Dalal']
authorsLink: ['https://www.linkedin.com/in/parinitaparinita/', 'https://www.linkedin.com/in/sunghoon/', 'https://www.linkedin.com/in/kunal-v-16315b94/', 'https://www.linkedin.com/in/maanavdalal/']
image: 'https://iili.io/2yV40bV.png'
imageSquare: 'https://iili.io/2yV40bV.png'
url: 'https://onnxruntime.ai/blogs/deepseek-r1-on-device'
---
Are you a developer looking to harness the power of your users' local compute for AI inferencing on PCs with NPUs, GPUs, and CPUs? Look no further!

Building on the recent ability to run models on [Copilot+PCs on NPUs](https://blogs.windows.com/windowsdeveloper/2025/01/29/running-distilled-deepseek-r1-models-locally-on-copilot-pcs-powered-by-windows-copilot-runtime/), you can now efficiently run these models on CPU and GPU devices as well. You can now download and run the ONNX optimized variants of the models from [Hugging Face](https://huggingface.co/onnxruntime/DeepSeek-R1-Distill-ONNX).



## Download and run your models easily!
The DeepSeek ONNX models enables you to run DeepSeek on any GPU or CPU, achieving performance speeds 1.3 to 6.3 times faster than native PyTorch. To easily get started with the model, you can use our ONNX Runtime `Generate()` API.
<!-- Video Embed -->
<div>
<iframe
class="pb-2 w-full"
height="600px"
src="https://www.youtube.com/embed/s63vSd8ZI5g"
title="YouTube video player"
frameborder="0"
allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture; web-share"
allowfullscreen
/>
</div>

### Quickstart on CPU


Installing onnxruntime-genai, olive, and dependencies for CPU in a virtual environment:
```python
python -m venv .venv && source .venv/bin/activate
pip install requests numpy --pre onnxruntime-genai olive-ai
```

Download the model directly using the huggingface cli:
```python
huggingface-cli download onnxruntime/DeepSeek-R1-Distill-ONNX --include "deepseek-r1-distill-qwen-1.5B/*" --local-dir ./
```

CPU Chat inference. If you pulled the model from huggingface, adjust the model directory (-m) accordingly:
```python
wget https://raw.githubusercontent.com/microsoft/onnxruntime-genai/refs/heads/main/examples/python/model-chat.py
python model-chat.py -m deepseek-r1-distill-qwen-1.5B/cpu_and_mobile/cpu-int4-rtn-block-32-acc-level-4 -e cpu
```

See instructions for GPU (CUDA, DML) [here](https://github.com/microsoft/onnxruntime/blob/gh-pages/docs/genai/tutorials/deepseek-python.md).
## ONNX Model Performance Improvements

ONNX enables you to run your models on-device across CPU, GPU, NPU. With ONNX you can run your models on any machine across all silica Qualcomm, AMD, Intel, Nvidia. See table below for some key benchmarks for Windows GPU and CPU devices.

| Model | Precision | Device Type | Execution Provider | Device | Token Generation Throughput | Speed Up vs PyTorch |
| ----------------------------------------- | ------------------ | ----------- | ------------------ | -------- | --------------------------- | ------------------- |
| deepseek-ai_DeepSeek-R1-Distill-Qwen-1.5B | ONNX fp16 | GPU | CUDA | RTX 4090 | 197.195 | 4X |
| deepseek-ai_DeepSeek-R1-Distill-Qwen-1.5B | ONNX Int4 | GPU | CUDA | RTX 4090 | 313.32 | 6.3X |
| deepseek-ai_DeepSeek-R1-Distill-Qwen-7B | ONNX fp16 | GPU | CUDA | RTX 4090 | 57.316 | 1.3X |
| deepseek-ai_DeepSeek-R1-Distill-Qwen-7B | ONNX Int4 | GPU | CUDA | RTX 4090 | 161.00 | 3.7X |
| deepseek-ai_DeepSeek-R1-Distill-Qwen-7B | ONNX Int4/bfloat16 | CPU | CPU | Intel i9 | 3.184 | 20X |
| deepseek-ai_DeepSeek-R1-Distill-Qwen-1.5B | ONNX Int4 | CPU | CPU | Intel i9 | 11.749 | 1.4X |

_CUDA BUILD SPECS: Onnxruntime-genai-cuda==0.6.0-dev, transformers==4.46.2, onnxruntime-gpu==1.20.1_ <br/>
_CPU BUILD SPECS: Onnxruntime-genai==0.6.0-dev, transformers==4.46.2, onnxruntime==1.20.01_

## Easily Finetune your models with Olive.

This [notebook](https://github.com/microsoft/Olive/blob/main/examples/getting_started/olive-deepseek-finetune.ipynb) provides a step-by-step guide to fine-tuning DeepSeek models using the Olive framework. It covers the process of setting up your environment, preparing your data, and leveraging Azure AI Foundry to optimize and deploy your models. The notebook is designed to help you get started quickly and efficiently with DeepSeek and Olive, making your AI development process smoother and more effective.


## Conclusion

Optimizing DeepSeek R1 distilled models with ONNX Runtime can lead to significant performance improvements. These optimized models are coming soon via Azure AI Foundry and can be easily accessed via the command line or the [VS Code AI Toolkit](https://code.visualstudio.com/docs/intelligentapps/overview).

By leveraging our AI framework solution with Azure Foundry, AI Toolkit, Olive, and ONNX Runtime you get your end-to-end solution for model development experience. Stay tuned for more updates and best practices on enhancing AI model performance.
<style>
a {
text-decoration: underline;
}
</style>
Loading