|
| 1 | +# Multi XPU (Step-3.5-Flash) |
| 2 | + |
| 3 | +## Run vllm-kunlun0.15.1-dev on Multi XPU |
| 4 | + |
| 5 | +Setup environment using container: |
| 6 | + |
| 7 | +```bash |
| 8 | +# !/bin/bash |
| 9 | +# rundocker.sh |
| 10 | +XPU_NUM=8 |
| 11 | +DOCKER_DEVICE_CONFIG="" |
| 12 | +if [ $XPU_NUM -gt 0 ]; then |
| 13 | + for idx in $(seq 0 $((XPU_NUM-1))); do |
| 14 | + DOCKER_DEVICE_CONFIG="${DOCKER_DEVICE_CONFIG} --device=/dev/xpu${idx}:/dev/xpu${idx}" |
| 15 | + done |
| 16 | + DOCKER_DEVICE_CONFIG="${DOCKER_DEVICE_CONFIG} --device=/dev/xpuctrl:/dev/xpuctrl" |
| 17 | +fi |
| 18 | + |
| 19 | +export build_image="xxxxxxxxxxxxxxxxx" |
| 20 | + |
| 21 | +docker run -itd ${DOCKER_DEVICE_CONFIG} \ |
| 22 | + --net=host \ |
| 23 | + --cap-add=SYS_PTRACE --security-opt seccomp=unconfined \ |
| 24 | + --tmpfs /dev/shm:rw,nosuid,nodev,exec,size=32g \ |
| 25 | + --cap-add=SYS_PTRACE \ |
| 26 | + -v /home/users/vllm-kunlun:/home/vllm-kunlun \ |
| 27 | + -v /usr/local/bin/xpu-smi:/usr/local/bin/xpu-smi \ |
| 28 | + --name "$1" \ |
| 29 | + -w /workspace \ |
| 30 | + "$build_image" /bin/bash |
| 31 | +``` |
| 32 | + |
| 33 | +### Offline Inference on Multi XPU |
| 34 | + |
| 35 | +Start the server in a container: |
| 36 | + |
| 37 | +```bash |
| 38 | +# export system variable |
| 39 | +# unset XPU_DUMMY_EVENT |
| 40 | +# export CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 |
| 41 | +# export XFT_USE_FAST_SWIGLU=1 #使用快速swiglu实现 |
| 42 | +# export XPU_USE_FAST_SWIGLU=1 #使用moe算子中快速swiglu实现 |
| 43 | +# export XMLIR_CUDNN_ENABLED=1 |
| 44 | +# export XPU_USE_DEFAULT_CTX=1 |
| 45 | +# export XMLIR_FORCE_USE_XPU_GRAPH=1 |
| 46 | +# export XPU_USE_MOE_SORTED_THRES=128 |
| 47 | +# export VLLM_HOST_IP=127.0.0.1 |
| 48 | +# export XMLIR_ENABLE_MOCK_TORCH_COMPILE=false |
| 49 | +# export VLLM_USE_V1=1 |
| 50 | +# export USE_ORI_ROPE=1 |
| 51 | +# export KUNLUN_DISABLE_SMALL_MOE=1 #step-3.5-flash temporary fix |
| 52 | + |
| 53 | +# python /workspace/offline.py |
| 54 | + |
| 55 | +from vllm import LLM, SamplingParams |
| 56 | + |
| 57 | +llm = LLM( |
| 58 | + model="/models/Step-3.5-Flash", |
| 59 | + tensor_parallel_size=8, |
| 60 | + dtype="bfloat16", |
| 61 | + max_model_len=32768, |
| 62 | + gpu_memory_utilization=0.9, |
| 63 | + trust_remote_code=True, |
| 64 | + distributed_executor_backend="mp", |
| 65 | + block_size=128, |
| 66 | + max_num_seqs=128, |
| 67 | + max_num_batched_tokens=32768, |
| 68 | + enable_prefix_caching=False, |
| 69 | + enable_chunked_prefill=False, |
| 70 | +) |
| 71 | + |
| 72 | +sampling_params = SamplingParams( |
| 73 | + temperature=0.7, |
| 74 | + top_p=0.9, |
| 75 | + top_k=10, |
| 76 | + max_tokens=512, |
| 77 | + stop=["<|end|>", "</s>"] |
| 78 | +) |
| 79 | + |
| 80 | +prompt = """ |
| 81 | +<|user|> |
| 82 | +你好,请介绍一下你自己 |
| 83 | +<|assistant|> |
| 84 | +""" |
| 85 | + |
| 86 | +outputs = llm.generate([prompt], sampling_params) |
| 87 | +print(outputs[0].outputs[0].text) |
| 88 | +``` |
| 89 | + |
| 90 | +::::: |
| 91 | +If you run this script successfully, you can see the info shown below: |
| 92 | + |
| 93 | +```bash |
| 94 | +================================================== |
| 95 | +你好!我是 **Step**,由 **阶跃星辰(StepFun)** 开发的多模态大语言模型。 |
| 96 | +我具备自然语言理解与生成、图像分析、视觉推理、数理逻辑、知识问答等多种能力。不仅能理解和处理文字信息,还能结合图片进行多模态推理与分析。 |
| 97 | + |
| 98 | +我的核心原则是:诚实可靠、有用友善、尊重隐私、促进积极交流、保持客观中立、拒绝有害内容。 |
| 99 | +简单来说,我的目标是为你提供准确、有帮助、温暖的智能支持。 |
| 100 | + |
| 101 | +如果你愿意,可以告诉我你的兴趣或需求,我会尽力帮你实现目标 😊 |
| 102 | +你想先了解我在哪些方面能帮到你吗? |
| 103 | +================================================== |
| 104 | +``` |
| 105 | + |
| 106 | +### Online Serving on Multi XPU |
| 107 | + |
| 108 | +Start the vLLM server on a multi XPU: |
| 109 | + |
| 110 | +```bash |
| 111 | +unset XPU_DUMMY_EVENT |
| 112 | +export CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 |
| 113 | +export XFT_USE_FAST_SWIGLU=1 #使用快速swiglu实现 |
| 114 | +export XPU_USE_FAST_SWIGLU=1 #使用moe算子中快速swiglu实现 |
| 115 | +export XMLIR_CUDNN_ENABLED=1 |
| 116 | +export XPU_USE_DEFAULT_CTX=1 |
| 117 | +export XMLIR_FORCE_USE_XPU_GRAPH=1 |
| 118 | +export XPU_USE_MOE_SORTED_THRES=128 |
| 119 | +export VLLM_HOST_IP=127.0.0.1 |
| 120 | +export XMLIR_ENABLE_MOCK_TORCH_COMPILE=false |
| 121 | +export VLLM_USE_V1=1 |
| 122 | +export USE_ORI_ROPE=1 |
| 123 | +export KUNLUN_DISABLE_SMALL_MOE=1 #step-3.5-flash temporary fix |
| 124 | + |
| 125 | +python -m vllm.entrypoints.openai.api_server \ |
| 126 | + --host 0.0.0.0 \ |
| 127 | + --port 8356 \ |
| 128 | + --model /models/Step-3.5-Flash \ |
| 129 | + --gpu-memory-utilization 0.9 \ |
| 130 | + --trust-remote-code \ |
| 131 | + --max-model-len 32768 \ |
| 132 | + --tensor-parallel-size 8 \ |
| 133 | + --dtype bfloat16 \ |
| 134 | + --max_num_seqs 128 \ |
| 135 | + --max_num_batched_tokens 32768 \ |
| 136 | + --block-size 128 \ |
| 137 | + --no-enable-prefix-caching \ |
| 138 | + --no-enable-chunked-prefill \ |
| 139 | + --distributed-executor-backend mp \ |
| 140 | + --served-model-name Step-3.5-Flash \ |
| 141 | + --reasoning-parser step3p5 \ |
| 142 | + --enable-auto-tool-choice \ |
| 143 | + --tool-call-parser step3p5 \ |
| 144 | +``` |
| 145 | + |
| 146 | +If your service start successfully, you can see the info shown below: |
| 147 | + |
| 148 | +```bash |
| 149 | +(APIServer pid=133800) INFO: Started server process [133800] |
| 150 | +(APIServer pid=133800) INFO: Waiting for application startup. |
| 151 | +(APIServer pid=133800) INFO: Application startup complete. |
| 152 | +``` |
| 153 | + |
| 154 | +Once your server is started, you can query the model with input prompts: |
| 155 | + |
| 156 | +```bash |
| 157 | +curl http://127.0.0.1:8356/v1/chat/completions |
| 158 | + -H "Content-Type: application/json" |
| 159 | + -d '{ |
| 160 | + "model": "Step-3.5-Flash", |
| 161 | + "messages": [ |
| 162 | + {"role": "user", "content": "你好,简单介绍一下你自己"} |
| 163 | + ], |
| 164 | + "max_tokens":200, |
| 165 | + "temperature": 0.7 |
| 166 | + }' |
| 167 | +``` |
| 168 | + |
| 169 | +Or use a Python script |
| 170 | + |
| 171 | +```python |
| 172 | +import requests |
| 173 | +import json |
| 174 | +import re |
| 175 | + |
| 176 | +URL = "http://127.0.0.1:8356/v1/chat/completions" |
| 177 | + |
| 178 | +payload = { |
| 179 | + "model": "Step-3.5-Flash", |
| 180 | + "messages": [ |
| 181 | + {"role": "user", "content": "你好,请介绍一下你自己"} |
| 182 | + ], |
| 183 | + "max_tokens": 500, |
| 184 | + "top_p": 0.8, |
| 185 | + "top_k": 10, |
| 186 | + "temperature": 0.7, |
| 187 | + # "presence_penalty": 0.3, |
| 188 | + # "repetition_penalty": 1.05, |
| 189 | + # At present, the model’s responses occasionally suffer from accuracy issues; you may wish to try adjusting the sampling parameters. |
| 190 | + |
| 191 | +} |
| 192 | + |
| 193 | +headers = { |
| 194 | + "Content-Type": "application/json", |
| 195 | + "Authorization": "Bearer EMPTY" |
| 196 | +} |
| 197 | + |
| 198 | +resp = requests.post(URL, headers=headers, json=payload) |
| 199 | +data = resp.json() |
| 200 | + |
| 201 | +choice = data["choices"][0] |
| 202 | +content = choice["message"]["content"] |
| 203 | + |
| 204 | +answer = content |
| 205 | + |
| 206 | +print("\n===== ANSWER =====\n") |
| 207 | +print(answer) |
| 208 | +``` |
| 209 | + |
| 210 | +If you query the server successfully, you can see the info shown below (client): |
| 211 | + |
| 212 | +```bash |
| 213 | +{"id":"chatcmpl-93112d4d8e047a9c","object":"chat.completion","created":1776166074,"model":"Step-3.5-Flash","choices":[{"index":0,"message":{"role":"assistant","content":"你好!我是 **Step**,由 **阶跃星辰(StepFun)** 开发的多\n","refusal":null,"annotations":null,"audio":null,"function_call":null,"tool_calls":[],"reasoning":"好的,用户让我简单介绍一下自己。首先我得明确身份,我是Step,由阶跃星辰(StepFun)开发。用户可能刚接触我,需要基础信息,比如功能、特点以及使用原则。\n\n然后考虑用户的需求场景,可能是第一次使用AI助手,或者想比较不同的AI。需要突出我的多模态能力,比如处理文字和图片,还有逻辑推理、知识问答这些核心功能。同时要强调中文","reasoning_content":"好的,用户让我简单介绍一下自己。首先我得明确身份,我是Step,由阶跃星辰(StepFun)开发。用户可能刚接触我,需要基础信息,比如功能、特点以及使用原则。\n\n然后考虑用户的需求场景,可能是第一次使用AI助手,或者想比较不同的AI。需要突出我的多模态能力,比如处理文字和图片,还有逻辑推理、知识问答这些核心功能。同时要强调中文"},"logprobs":null,"finish_reason":"stop","stop_reason":1,"token_ids":null}],"service_tier":null,"system_fingerprint":null,"usage":{"prompt_tokens":17,"total_tokens":131,"completion_tokens":114,"prompt_tokens_details":null},"prompt_logprobs":null,"prompt_token_ids":null,"kv_transfer_params":null} |
| 214 | + |
| 215 | +# python script |
| 216 | +===== ANSWER ===== |
| 217 | + |
| 218 | +你好!我是 **Step**,由 **阶跃星辰(StepFun)** 开发的大语言模型。 |
| 219 | + |
| 220 | +我具备以下主要能力和特点: |
| 221 | +- 🧠 **自然语言理解与生成**:能够流畅地进行多轮对话、写作、总结、翻译等; |
| 222 | +- 👁️ **多模态推理**:不仅能处理文字,还能理解和分析图片内容,进行视觉推理; |
| 223 | +- 📚 **知识问答与逻辑推理**:擅长基于事实回答问题,并解决数学、逻辑类任务; |
| 224 | +- 💡 **创意表达**:可辅助创作故事、诗歌、策划方案等富有创意的内容; |
| 225 | +- 🌍 **多语言支持**:能用多种语言与用户交流; |
| 226 | +- 🤝 **安全可靠**:遵循诚实、友善、尊重隐私的原则,保持客观中立。 |
| 227 | + |
| 228 | +我目前是 **完全免费使用** 的,不收集或存储你的个人隐私信息。你可以随时向我提问、讨论、创作或探索各种主题~ |
| 229 | + |
| 230 | +你想先了解我在哪方面最擅长吗? |
| 231 | +``` |
| 232 | + |
| 233 | +Logs of the vllm server: |
| 234 | + |
| 235 | +```bash |
| 236 | +(APIServer pid=182858) INFO 04-14 19:45:26 [loggers.py:257] Engine 000: Avg prompt throughput: 1.7 tokens/s, Avg generation throughput: 19.3 tokens/s, Running: 1 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.0%, Prefix cache hit rate: 0.0% |
| 237 | +(APIServer pid=182858) INFO: 127.0.0.1:12670 - "POST /v1/chat/completions HTTP/1.1" 200 OK |
| 238 | +(APIServer pid=182858) INFO 04-14 19:45:36 [loggers.py:257] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 24.2 tokens/s, Running: 0 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.0%, Prefix cache hit rate: 0.0% |
| 239 | +(APIServer pid=182858) INFO 04-14 19:45:46 [loggers.py:257] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 0 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.0%, Prefix cache hit rate: 0.0% |
| 240 | +``` |
0 commit comments