Skip to content

Commit 92574e6

Browse files
oandreeva-nvpranavm-nvidiakrishung5rmccorm4
authored
docs: Add DeepSeek tutorial (#128)
* Deepseek tutorial --------- Co-authored-by: pranavm-nvidia <[email protected]> Co-authored-by: Kris Hung <[email protected]> Co-authored-by: Ryan McCormick <[email protected]>
1 parent 8d360a3 commit 92574e6

File tree

1 file changed

+144
-0
lines changed

1 file changed

+144
-0
lines changed
Lines changed: 144 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,144 @@
1+
<!--
2+
# Copyright 2025, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
3+
#
4+
# Redistribution and use in source and binary forms, with or without
5+
# modification, are permitted provided that the following conditions
6+
# are met:
7+
# * Redistributions of source code must retain the above copyright
8+
# notice, this list of conditions and the following disclaimer.
9+
# * Redistributions in binary form must reproduce the above copyright
10+
# notice, this list of conditions and the following disclaimer in the
11+
# documentation and/or other materials provided with the distribution.
12+
# * Neither the name of NVIDIA CORPORATION nor the names of its
13+
# contributors may be used to endorse or promote products derived
14+
# from this software without specific prior written permission.
15+
#
16+
# THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS ``AS IS'' AND ANY
17+
# EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
18+
# IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR
19+
# PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER OR
20+
# CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL,
21+
# EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO,
22+
# PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR
23+
# PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY
24+
# OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
25+
# (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
26+
# OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
27+
-->
28+
# Deploying DeepSeek-R1-Distill-Llama-8B model with Triton
29+
30+
In this tutorial we'll use vLLM Backend to deploy
31+
[`DeepSeek-R1-Distill-Llama-8B`](https://huggingface.co/deepseek-ai/DeepSeek-R1-Distill-Llama-8B).
32+
Read more about vLLM [here](https://blog.vllm.ai/2023/06/20/vllm.html) and
33+
the vLLM Backend [here](https://github.com/triton-inference-server/vllm_backend).
34+
35+
## Model Repository
36+
37+
Let's first set up a model repository. In this tutorial we'll use the sample
38+
model repository, provided in the [Triton vLLM backend repository.](https://github.com/triton-inference-server/vllm_backend/tree/main/samples/model_repository/vllm_model)
39+
40+
You can clone the full repository with:
41+
```bash
42+
git clone -b r25.01 https://github.com/triton-inference-server/vllm_backend.git
43+
```
44+
45+
The sample model repository uses [`facebook/opt-125m` model,](https://github.com/triton-inference-server/vllm_backend/blob/80dd0371e0301fabf79c57536e60700d016fcc76/samples/model_repository/vllm_model/1/model.json#L2)
46+
let's replace it with `"deepseek-ai/DeepSeek-R1-Distill-Llama-8B"`.
47+
Additionally, please note, that with the default parameters it's important to adjust `gpu_memory_utilization` appropriately to
48+
your hardware. Please note, that with all default parameters
49+
`"deepseek-ai/DeepSeek-R1-Distill-Llama-8B"` needs about 35GB of memory to be
50+
deployed via Triton + vLLM backend, make sure to adjust "gpu_memory_utilization"
51+
accordingly. For example, for RTX 5880 the minimum value should be `0.69`, at
52+
the same time `0.41` is sufficient for A100. For the simplicity of this
53+
tutorial, we'll set this number to `0.9`. The resulting `model.json` should
54+
look like:
55+
```json
56+
{
57+
"model":"deepseek-ai/DeepSeek-R1-Distill-Llama-8B",
58+
"disable_log_requests": true,
59+
"gpu_memory_utilization": 0.9,
60+
"enforce_eager": true
61+
}
62+
```
63+
64+
## Serving with Triton
65+
66+
Then you can run the tritonserver as usual
67+
```bash
68+
LOCAL_MODEL_REPOSITORY=./vllm_backend/samples/model_repository/
69+
docker run --rm -it --net host --shm-size=2g --ulimit memlock=-1 \
70+
--ulimit stack=67108864 --gpus all -v $LOCAL_MODEL_REPOSITORY:/opt/tritonserver/model_repository \
71+
nvcr.io/nvidia/tritonserver:25.01-vllm-python-py3 tritonserver --model-repository=model_repository/
72+
```
73+
The server has launched successfully when you see the following outputs in your console:
74+
75+
```
76+
I0922 23:28:40.351809 1 grpc_server.cc:2451] Started GRPCInferenceService at 0.0.0.0:8001
77+
I0922 23:28:40.352017 1 http_server.cc:3558] Started HTTPService at 0.0.0.0:8000
78+
I0922 23:28:40.395611 1 http_server.cc:187] Started Metrics Service at 0.0.0.0:8002
79+
```
80+
81+
## Sending requests via the `generate` endpoint
82+
83+
As a simple example to make sure the server works, you can use the `generate` endpoint to test. More about the generate endpoint [here](https://github.com/triton-inference-server/server/blob/main/docs/protocol/extension_generate.md).
84+
85+
```bash
86+
$ curl -X POST localhost:8000/v2/models/vllm_model/generate -d '{"text_input": "What is Triton Inference Server?", "parameters": {"stream": false, "temperature": 0, "exclude_input_in_output": true, "max_tokens": 45}}' | jq
87+
```
88+
The expected output should look like:
89+
```json
90+
{
91+
"model_name": "vllm_model",
92+
"model_version": "1",
93+
"text_output": " It's a high-performance, scalable, and efficient inference server for AI models. It's designed to handle large numbers of requests quickly and efficiently, making it suitable for real-time applications like autonomous vehicles, smart homes, and more"
94+
}
95+
```
96+
97+
## Sending requests via the Triton client
98+
99+
The Triton vLLM Backend repository has a [samples folder](https://github.com/triton-inference-server/vllm_backend/tree/main/samples)
100+
that has an example client.py to test the model.
101+
102+
```bash
103+
LOCAL_WORKSPACE=./vllm_backend/samples
104+
docker run -ti --gpus all --network=host --pid=host --ipc=host -v $LOCAL_WORKSPACE:/workspace nvcr.io/nvidia/tritonserver:25.01-py3-sdk
105+
```
106+
Then you can use client as follows:
107+
```bash
108+
python client.py -m vllm_model
109+
```
110+
111+
The following steps should result in a `results.txt` that has the following content
112+
```
113+
Hello, my name is
114+
I need to write a program that can read a text file and find all the names in the text. The names can be in any case (uppercase, lowercase, or mixed). Also, the names can be part of longer words or phrases, so I need to make sure that I'm extracting only the names and not parts of other words. Additionally, the names can be separated by various non-word characters, such as commas, periods, apostrophes, etc. So, I need to extract
115+
116+
=========
117+
118+
The most dangerous animal is
119+
The most dangerous animal is the one that poses the greatest threat to human safety and well-being. This can vary depending on the region and the specific circumstances. For example, in some areas, large predators like lions or tigers might be considered the most dangerous, while in others, venomous snakes or dangerous marine animals might take precedence.
120+
121+
To determine the most dangerous animal, one would need to consider factors such as:
122+
1. **Number of incidents**: How many people have been injured or killed by this
123+
124+
=========
125+
126+
The capital of France is
127+
A) London
128+
B) Paris
129+
C) Marseille
130+
D) Lyon
131+
132+
Okay, so I have this question here: "The capital of France is..." with options A) London, B) Paris, C) Marseille, D) Lyon. Hmm, I need to figure out the correct answer. Let me think about what I know regarding the capitals of different countries.
133+
134+
First off, I remember that France is a country in Western Europe. I've heard people talk about Paris before, especially in
135+
136+
=========
137+
138+
The future of AI is
139+
AI is the future of everything. It's going to change how we live, work, and interact with the world. From healthcare to education, from transportation to entertainment, AI will play a crucial role in shaping our tomorrow. But what does that mean for us? How will AI impact our daily lives? Let's explore some possibilities.
140+
141+
First, in healthcare, AI can help diagnose diseases faster and more accurately than ever before. It can analyze medical data, recommend treatments, and even assist in surgery.
142+
143+
=========
144+
```

0 commit comments

Comments
 (0)