|
| 1 | +{ |
| 2 | + "cells": [ |
| 3 | + { |
| 4 | + "cell_type": "markdown", |
| 5 | + "id": "7d551647-dfc2-47da-bc8a-3792af622073", |
| 6 | + "metadata": {}, |
| 7 | + "source": [ |
| 8 | + "# vLLM Module with MLRun\n", |
| 9 | + "\n", |
| 10 | + "This notebook shows how to configure and deploy a vLLM OpenAI compatible server as an MLRun application runtime, then showcases how to send a chat request to it to the vLLM server." |
| 11 | + ] |
| 12 | + }, |
| 13 | + { |
| 14 | + "cell_type": "code", |
| 15 | + "execution_count": 1, |
| 16 | + "id": "7707b270-30cc-448a-a828-cb93aa28030d", |
| 17 | + "metadata": {}, |
| 18 | + "outputs": [], |
| 19 | + "source": [ |
| 20 | + "import mlrun\n", |
| 21 | + ] |
| 22 | + }, |
| 23 | + { |
| 24 | + "cell_type": "markdown", |
| 25 | + "id": "d5cff681-bfdf-4468-a1d1-2aeadb56065e", |
| 26 | + "metadata": {}, |
| 27 | + "source": [ |
| 28 | + "## Prerequisite\n", |
| 29 | + "* At lease one GPU is required for running this notebook." |
| 30 | + ] |
| 31 | + }, |
| 32 | + { |
| 33 | + "cell_type": "markdown", |
| 34 | + "id": "d5c84798-289f-4b4f-8c1b-f4dd12a3bda5", |
| 35 | + "metadata": {}, |
| 36 | + "source": [ |
| 37 | + "## What this notebook does\n", |
| 38 | + "\n", |
| 39 | + "In this notebook we will:\n", |
| 40 | + "\n", |
| 41 | + "- Create or load an **MLRun project**\n", |
| 42 | + "- Import a custom **vLLM module** from the MLRun Hub\n", |
| 43 | + "- Deploy a **vLLM OpenAI-compatible server** as an MLRun application runtime\n", |
| 44 | + "- Configure deployment parameters such as model, GPU count, memory, node selector, port, and log level\n", |
| 45 | + "- Invoke the deployed service using the `/v1/chat/completions` endpoint\n", |
| 46 | + "- Parse the response and extract only the assistant’s generated text\n", |
| 47 | + "\n", |
| 48 | + "By the end of this notebook, you will have a working vLLM deployment that can be queried directly from a Jupyter notebook using OpenAI-style APIs.\n", |
| 49 | + "\n", |
| 50 | + "For more information about [vLLM documentation](https://docs.vllm.ai/en/latest/serving/openai_compatible_server/)" |
| 51 | + ] |
| 52 | + }, |
| 53 | + { |
| 54 | + "cell_type": "markdown", |
| 55 | + "id": "879ca641-ee35-4682-9995-4eb319d89090", |
| 56 | + "metadata": {}, |
| 57 | + "source": [ |
| 58 | + "## 1. Create an MLRun project\n", |
| 59 | + "\n", |
| 60 | + "In this section we create or load an MLRun project that will own the deployed vLLM application runtime." |
| 61 | + ] |
| 62 | + }, |
| 63 | + { |
| 64 | + "cell_type": "code", |
| 65 | + "execution_count": null, |
| 66 | + "id": "6eac263a-17d1-4454-9e19-459dfbe2f231", |
| 67 | + "metadata": {}, |
| 68 | + "outputs": [], |
| 69 | + "source": [ |
| 70 | + "project = mlrun.get_or_create_project(name=\"vllm-module\", context=\"\", user_project=True)" |
| 71 | + ] |
| 72 | + }, |
| 73 | + { |
| 74 | + "cell_type": "markdown", |
| 75 | + "id": "da49d335-b704-4fb6-801f-4d07b64f9be6", |
| 76 | + "metadata": {}, |
| 77 | + "source": [ |
| 78 | + "## 2. Import the vLLM module from the MLRun Hub\n", |
| 79 | + "\n", |
| 80 | + "In this section we import the vLLM module from the MLRun Hub so we can instantiate `VLLMModule` and deploy it as an application runtime." |
| 81 | + ] |
| 82 | + }, |
| 83 | + { |
| 84 | + "cell_type": "code", |
| 85 | + "execution_count": null, |
| 86 | + "id": "e6d89dee-db58-4c0c-8009-b37020c9599a", |
| 87 | + "metadata": {}, |
| 88 | + "outputs": [], |
| 89 | + "source": [ |
| 90 | + "vllm = mlrun.import_module(\"hub://vllm-module\")" |
| 91 | + ] |
| 92 | + }, |
| 93 | + { |
| 94 | + "cell_type": "markdown", |
| 95 | + "id": "1202ddd5-0ce7-4769-be29-8fc264c1f80e", |
| 96 | + "metadata": {}, |
| 97 | + "source": [ |
| 98 | + "## 3. Deploy the vLLM application runtime\n", |
| 99 | + "\n", |
| 100 | + "Configure the vLLM deployment parameters and deploy the application.\n", |
| 101 | + "\n", |
| 102 | + "The returned address is the service URL for the application runtime." |
| 103 | + ] |
| 104 | + }, |
| 105 | + { |
| 106 | + "cell_type": "code", |
| 107 | + "execution_count": null, |
| 108 | + "id": "e433123a-e64b-4a7a-8c7f-8165bcdcc6d1", |
| 109 | + "metadata": {}, |
| 110 | + "outputs": [], |
| 111 | + "source": [ |
| 112 | + "# Initialize the vLLM app\n", |
| 113 | + "vllm_module = vllm.VLLMModule(\n", |
| 114 | + " project=project,\n", |
| 115 | + " node_selector={\"alpha.eksctl.io/nodegroup-name\": \"added-gpu\"},\n", |
| 116 | + " name=\"qwen-vllm\",\n", |
| 117 | + " image=\"vllm/vllm-openai:latest\",\n", |
| 118 | + " model=\"Qwen/Qwen2.5-Omni-3B\",\n", |
| 119 | + " gpus=1,\n", |
| 120 | + " mem=\"10G\",\n", |
| 121 | + " port=8000,\n", |
| 122 | + " dtype=\"auto\",\n", |
| 123 | + " uvicorn_log_level=\"info\",\n", |
| 124 | + " max_tokens = 501,\n", |
| 125 | + ")\n", |
| 126 | + "\n", |
| 127 | + "# Deploy the vLLM app\n", |
| 128 | + "addr = vllm_module.vllm_app.deploy(with_mlrun=True)\n", |
| 129 | + "addr" |
| 130 | + ] |
| 131 | + }, |
| 132 | + { |
| 133 | + "cell_type": "markdown", |
| 134 | + "id": "06832de3-5c31-43bf-b07b-0e71fb2d072d", |
| 135 | + "metadata": {}, |
| 136 | + "source": [ |
| 137 | + "## 4. Get the runtime handle\n", |
| 138 | + "\n", |
| 139 | + "Fetch the runtime object and invoke the service using `app.invoke(...)`." |
| 140 | + ] |
| 141 | + }, |
| 142 | + { |
| 143 | + "cell_type": "code", |
| 144 | + "execution_count": null, |
| 145 | + "id": "102d3fd0-1ee6-49b8-8c86-df742ac1c559", |
| 146 | + "metadata": {}, |
| 147 | + "outputs": [], |
| 148 | + "source": [ |
| 149 | + "# Optional: get_runtime() method uses to get the MLRun application runtime\n", |
| 150 | + "app = vllm_module.get_runtime()" |
| 151 | + ] |
| 152 | + }, |
| 153 | + { |
| 154 | + "cell_type": "markdown", |
| 155 | + "id": "925730c1-0ac5-454b-8fb2-ab8cebb3f3ac", |
| 156 | + "metadata": {}, |
| 157 | + "source": [ |
| 158 | + "## 5. Send a chat request for testing\n", |
| 159 | + "\n", |
| 160 | + "Call the OpenAI compatible endpoint `/v1/chat/completions`, parse the JSON response, and print only the assistant message text." |
| 161 | + ] |
| 162 | + }, |
| 163 | + { |
| 164 | + "cell_type": "code", |
| 165 | + "execution_count": 28, |
| 166 | + "id": "31bc78d4-1c6f-439c-b894-1522e3a6d3e6", |
| 167 | + "metadata": {}, |
| 168 | + "outputs": [], |
| 169 | + "source": [ |
| 170 | + "body = {\n", |
| 171 | + " \"model\": vllm_module.model,\n", |
| 172 | + " \"messages\": [{\"role\": \"user\", \"content\": \"what are the 3 countries with the most gpu as far as you know\"}],\n", |
| 173 | + " \"max_tokens\": vllm_module.max_tokens, # start smaller for testing\n", |
| 174 | + "}\n", |
| 175 | + "\n", |
| 176 | + "resp = app.invoke(path=\"/v1/chat/completions\", body=body)" |
| 177 | + ] |
| 178 | + }, |
| 179 | + { |
| 180 | + "cell_type": "code", |
| 181 | + "execution_count": 29, |
| 182 | + "id": "a459d5f8-dad0-4735-94c2-3801d4f94bb5", |
| 183 | + "metadata": {}, |
| 184 | + "outputs": [ |
| 185 | + { |
| 186 | + "name": "stdout", |
| 187 | + "output_type": "stream", |
| 188 | + "text": [ |
| 189 | + "Raw response keys: dict_keys(['id', 'object', 'created', 'model', 'choices', 'service_tier', 'system_fingerprint', 'usage', 'prompt_logprobs', 'prompt_token_ids', 'kv_transfer_params'])\n", |
| 190 | + "\n", |
| 191 | + "assistant:\n", |
| 192 | + "\n", |
| 193 | + "As of July 2023, the three countries with the most scientists in articles that explain or discuss GPU contributions to AI are the United States, China, and India.\n", |
| 194 | + "The number of scientists is not the best measure of the number of GPUs. According to Practical Deep Learning forמות, China has 6,307 GPU-equipped tens of thousands of compute servers, the number of GPUs in the top 100 supercomputers is 6,492 (19th largest: 10,363), and the per capita number of GPUs invested by research institutions is high. From the total value perspective, the annual procurement increase of GPUs in China is estimated to be more than $40 billion. Similarly, the United States and India have significantly higher prices than China purely due to price controls.\n", |
| 195 | + "In summary, there is limited data to support the claim that GPU prices vary significantly between the three countries. However, China has a significant number of GPUs in use, and its computational resources are some of the largest in the world.\n" |
| 196 | + ] |
| 197 | + } |
| 198 | + ], |
| 199 | + "source": [ |
| 200 | + "data = resp.json()\n", |
| 201 | + "assistant_text = data[\"choices\"][0][\"message\"][\"content\"]\n", |
| 202 | + "\n", |
| 203 | + "print(\"\\nassistant:\\n\")\n", |
| 204 | + "print(assistant_text.strip())" |
| 205 | + ] |
| 206 | + }, |
| 207 | + { |
| 208 | + "cell_type": "code", |
| 209 | + "execution_count": null, |
| 210 | + "id": "1de85b32-9c91-4609-9a63-0b38ed4fde65", |
| 211 | + "metadata": {}, |
| 212 | + "outputs": [], |
| 213 | + "source": [] |
| 214 | + } |
| 215 | + ], |
| 216 | + "metadata": { |
| 217 | + "kernelspec": { |
| 218 | + "display_name": "mlrun-base", |
| 219 | + "language": "python", |
| 220 | + "name": "conda-env-mlrun-base-py" |
| 221 | + }, |
| 222 | + "language_info": { |
| 223 | + "codemirror_mode": { |
| 224 | + "name": "ipython", |
| 225 | + "version": 3 |
| 226 | + }, |
| 227 | + "file_extension": ".py", |
| 228 | + "mimetype": "text/x-python", |
| 229 | + "name": "python", |
| 230 | + "nbconvert_exporter": "python", |
| 231 | + "pygments_lexer": "ipython3", |
| 232 | + "version": "3.9.22" |
| 233 | + } |
| 234 | + }, |
| 235 | + "nbformat": 4, |
| 236 | + "nbformat_minor": 5 |
| 237 | +} |
0 commit comments